Files
prompts/AI-Models_Research.md
2026-04-30 12:30:14 +02:00

2.9 KiB
Raw Permalink Blame History

AI models research

Quantization impact on Qwen 27B

My on-premise setup

Just as a side info: I run a NVIDIA RTX5070 Ti with 16 GB VRAM and it's Blackwell architecture allows performance improvements with 4-bit quantized AI models.

Motivation

I want to find out how quantization degrades intelligence and improves speed by looking at real-world reports and comparisons.

AI model

The model I would like to see compared is 'Qwen3.6-27B' but since this model is pretty new it also would be OK to see comparisons of 'Qwen3.5-27B'. I am interested in both reasoning / thinking mode and instruct mode.

Quantizations

Comparisons between the original model weights size 16-bit and quantized with 8-bit and (most important) quantized with 4-bit are desired.

Your task

Please perform a deep research to find the requested experience reports and comparisons.

Ask questions first

Before starting, ask me between 2 and 5 questions to completely understand the situation and your task.


Instruction following, terminal coding, logic reasoning

Only Qwen models with 27B or less than 27B


My on-premise setup was provided just as a side info. No need to take it into account for the deep research.

To 1.: No model offloading at all. To 2.: My inference framework plans are not relevant for the deep research. To 3.: Instruction following, terminal coding, logic reasoning. To 4.: Comparisons with FP4 would be great, yes, try to find such reports.


You are wrong. Here are the Hugging Face model webpages to show you the models exist (but obviouosly were released after your Knowledge-Cutoff date):


To 1. Precision formats you care about most

  • With “16bit”, I mean the original Qwen model which has BF16 tensors.
  • With “8bit”, I mean all of GPTQInt8, AWQInt8, and GGUFQ8_0.
  • With “4bit”, I mean all of GPTQInt4, AWQInt4, GGUFQ4_K_M or other variants like NVFP4.

I am open to whichever 4bit/8bit quantization is beststudied for Qwen 27B models.

To 2. Workload focus: reasoning vs coding vs general chat

I care most for these two: reasoning and code generation / debugging

To 3. Metric priorities

For “intelligence loss”, I want standard eval scores and taskspecific passrates. For “speed”, I care for both firsttoken latency and throughput.

To 4. Inference stack hints

For the deep research, my plans for the inference stack are not relevant. Any stack is interesting and might impact my inference stack preference.

To 5. Localonly vs “cloudstyle” scores

I'm also okay with multiGPU BF16 numbers that illustrate the “ceiling” of unquantized performance.