79 lines
2.9 KiB
Markdown
79 lines
2.9 KiB
Markdown
# AI models research
|
||
|
||
## Quantization impact on Qwen 27B
|
||
|
||
### My on-premise setup
|
||
|
||
Just as a side info: I run a NVIDIA RTX5070 Ti with 16 GB VRAM and it's Blackwell architecture allows performance improvements with 4-bit quantized AI models.
|
||
|
||
### Motivation
|
||
|
||
I want to find out how quantization degrades intelligence and improves speed by looking at real-world reports and comparisons.
|
||
|
||
### AI model
|
||
|
||
The model I would like to see compared is 'Qwen3.6-27B' but since this model is pretty new it also would be OK to see comparisons of 'Qwen3.5-27B'.
|
||
I am interested in both reasoning / thinking mode and instruct mode.
|
||
|
||
### Quantizations
|
||
|
||
Comparisons between the original model weights size 16-bit and quantized with 8-bit and (most important) quantized with 4-bit are desired.
|
||
|
||
### Your task
|
||
|
||
Please perform a deep research to find the requested experience reports and comparisons.
|
||
|
||
### Ask questions first
|
||
|
||
Before starting, ask me between 2 and 5 questions to completely understand the situation and your task.
|
||
|
||
---
|
||
|
||
Instruction following, terminal coding, logic reasoning
|
||
|
||
Only Qwen models with 27B or less than 27B
|
||
|
||
---
|
||
|
||
My on-premise setup was provided just as a side info. No need to take it into account for the deep research.
|
||
|
||
To 1.: No model offloading at all.
|
||
To 2.: My inference framework plans are not relevant for the deep research.
|
||
To 3.: Instruction following, terminal coding, logic reasoning.
|
||
To 4.: Comparisons with FP4 would be great, yes, try to find such reports.
|
||
|
||
---
|
||
|
||
You are wrong. Here are the Hugging Face model webpages to show you the models exist (but obviouosly were released after your Knowledge-Cutoff date):
|
||
- [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
|
||
- [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)
|
||
|
||
---
|
||
|
||
## To 1. **Precision formats you care about most**
|
||
|
||
- With “16‑bit”, I mean the original Qwen model which has BF16 tensors.
|
||
- With “8‑bit”, I mean all of GPTQ‑Int8, AWQ‑Int8, and GGUF‑Q8_0.
|
||
- With “4‑bit”, I mean all of GPTQ‑Int4, AWQ‑Int4, GGUF‑Q4_K_M or other variants like NVFP4.
|
||
|
||
I am open to whichever 4‑bit/8‑bit quantization is best‑studied for Qwen 27B models.
|
||
|
||
## To 2. **Workload focus: reasoning vs coding vs general chat**
|
||
|
||
I care most for these two: *reasoning* and *code generation / debugging*
|
||
|
||
## To 3. **Metric priorities**
|
||
|
||
For “intelligence loss”, I want *standard eval scores* and *task‑specific pass‑rates*.
|
||
For “speed”, I care for both *first‑token latency* and *throughput*.
|
||
|
||
## To 4. **Inference stack hints**
|
||
|
||
For the deep research, my plans for the inference stack are not relevant. Any stack is interesting and might impact my inference stack preference.
|
||
|
||
## To 5. **Local‑only vs “cloud‑style” scores**
|
||
|
||
I'm also okay with *multi‑GPU BF16 numbers* that illustrate the “ceiling” of un‑quantized performance.
|
||
|
||
---
|