2.9 KiB
AI models research
Quantization impact on Qwen 27B
My on-premise setup
Just as a side info: I run a NVIDIA RTX5070 Ti with 16 GB VRAM and it's Blackwell architecture allows performance improvements with 4-bit quantized AI models.
Motivation
I want to find out how quantization degrades intelligence and improves speed by looking at real-world reports and comparisons.
AI model
The model I would like to see compared is 'Qwen3.6-27B' but since this model is pretty new it also would be OK to see comparisons of 'Qwen3.5-27B'. I am interested in both reasoning / thinking mode and instruct mode.
Quantizations
Comparisons between the original model weights size 16-bit and quantized with 8-bit and (most important) quantized with 4-bit are desired.
Your task
Please perform a deep research to find the requested experience reports and comparisons.
Ask questions first
Before starting, ask me between 2 and 5 questions to completely understand the situation and your task.
Instruction following, terminal coding, logic reasoning
Only Qwen models with 27B or less than 27B
My on-premise setup was provided just as a side info. No need to take it into account for the deep research.
To 1.: No model offloading at all. To 2.: My inference framework plans are not relevant for the deep research. To 3.: Instruction following, terminal coding, logic reasoning. To 4.: Comparisons with FP4 would be great, yes, try to find such reports.
You are wrong. Here are the Hugging Face model webpages to show you the models exist (but obviouosly were released after your Knowledge-Cutoff date):
To 1. Precision formats you care about most
- With “16‑bit”, I mean the original Qwen model which has BF16 tensors.
- With “8‑bit”, I mean all of GPTQ‑Int8, AWQ‑Int8, and GGUF‑Q8_0.
- With “4‑bit”, I mean all of GPTQ‑Int4, AWQ‑Int4, GGUF‑Q4_K_M or other variants like NVFP4.
I am open to whichever 4‑bit/8‑bit quantization is best‑studied for Qwen 27B models.
To 2. Workload focus: reasoning vs coding vs general chat
I care most for these two: reasoning and code generation / debugging
To 3. Metric priorities
For “intelligence loss”, I want standard eval scores and task‑specific pass‑rates. For “speed”, I care for both first‑token latency and throughput.
To 4. Inference stack hints
For the deep research, my plans for the inference stack are not relevant. Any stack is interesting and might impact my inference stack preference.
To 5. Local‑only vs “cloud‑style” scores
I'm also okay with multi‑GPU BF16 numbers that illustrate the “ceiling” of un‑quantized performance.