# AI models research ## Quantization impact on Qwen 27B ### My on-premise setup Just as a side info: I run a NVIDIA RTX5070 Ti with 16 GB VRAM and it's Blackwell architecture allows performance improvements with 4-bit quantized AI models. ### Motivation I want to find out how quantization degrades intelligence and improves speed by looking at real-world reports and comparisons. ### AI model The model I would like to see compared is 'Qwen3.6-27B' but since this model is pretty new it also would be OK to see comparisons of 'Qwen3.5-27B'. I am interested in both reasoning / thinking mode and instruct mode. ### Quantizations Comparisons between the original model weights size 16-bit and quantized with 8-bit and (most important) quantized with 4-bit are desired. ### Your task Please perform a deep research to find the requested experience reports and comparisons. ### Ask questions first Before starting, ask me between 2 and 5 questions to completely understand the situation and your task. --- Instruction following, terminal coding, logic reasoning Only Qwen models with 27B or less than 27B --- My on-premise setup was provided just as a side info. No need to take it into account for the deep research. To 1.: No model offloading at all. To 2.: My inference framework plans are not relevant for the deep research. To 3.: Instruction following, terminal coding, logic reasoning. To 4.: Comparisons with FP4 would be great, yes, try to find such reports. --- You are wrong. Here are the Hugging Face model webpages to show you the models exist (but obviouosly were released after your Knowledge-Cutoff date): - [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) - [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) --- ## To 1. **Precision formats you care about most** - With “16‑bit”, I mean the original Qwen model which has BF16 tensors. - With “8‑bit”, I mean all of GPTQ‑Int8, AWQ‑Int8, and GGUF‑Q8_0. - With “4‑bit”, I mean all of GPTQ‑Int4, AWQ‑Int4, GGUF‑Q4_K_M or other variants like NVFP4. I am open to whichever 4‑bit/8‑bit quantization is best‑studied for Qwen 27B models. ## To 2. **Workload focus: reasoning vs coding vs general chat** I care most for these two: *reasoning* and *code generation / debugging* ## To 3. **Metric priorities** For “intelligence loss”, I want *standard eval scores* and *task‑specific pass‑rates*. For “speed”, I care for both *first‑token latency* and *throughput*. ## To 4. **Inference stack hints** For the deep research, my plans for the inference stack are not relevant. Any stack is interesting and might impact my inference stack preference. ## To 5. **Local‑only vs “cloud‑style” scores** I'm also okay with *multi‑GPU BF16 numbers* that illustrate the “ceiling” of un‑quantized performance. ---