prompts/AI-Models_Research.md

# AI models research

## Quantization impact on Qwen 27B

### My on-premise setup

Just as a side info: I run a NVIDIA RTX5070 Ti with 16 GB VRAM and it's Blackwell architecture allows performance improvements with 4-bit quantized AI models.

### Motivation

I want to find out how quantization degrades intelligence and improves speed by looking at real-world reports and comparisons.

### AI model

The model I would like to see compared is 'Qwen3.6-27B' but since this model is pretty new it also would be OK to see comparisons of 'Qwen3.5-27B'.
I am interested in both reasoning / thinking mode and instruct mode.

### Quantizations

Comparisons between the original model weights size 16-bit and quantized with 8-bit and (most important) quantized with 4-bit are desired.

### Your task

Please perform a deep research to find the requested experience reports and comparisons.

### Ask questions first

Before starting, ask me between 2 and 5 questions to completely understand the situation and your task.

---

Instruction following, terminal coding, logic reasoning

Only Qwen models with 27B or less than 27B

---

My on-premise setup was provided just as a side info. No need to take it into account for the deep research.

To 1.: No model offloading at all.
To 2.: My inference framework plans are not relevant for the deep research.
To 3.: Instruction following, terminal coding, logic reasoning.
To 4.: Comparisons with FP4 would be great, yes, try to find such reports.

---

You are wrong. Here are the Hugging Face model webpages to show you the models exist (but obviouosly were released after your Knowledge-Cutoff date):
- [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)

---

## To 1. **Precision formats you care about most**

- With “16‑bit”, I mean the original Qwen model which has BF16 tensors.
- With “8‑bit”, I mean all of GPTQ‑Int8, AWQ‑Int8, and GGUF‑Q8_0.
- With “4‑bit”, I mean all of GPTQ‑Int4, AWQ‑Int4, GGUF‑Q4_K_M or other variants like NVFP4.

I am open to whichever 4‑bit/8‑bit quantization is best‑studied for Qwen 27B models.

## To 2. **Workload focus: reasoning vs coding vs general chat**

I care most for these two: *reasoning* and *code generation / debugging*

## To 3. **Metric priorities**

For “intelligence loss”, I want *standard eval scores* and *task‑specific pass‑rates*.
For “speed”, I care for both *first‑token latency* and *throughput*.

## To 4. **Inference stack hints**

For the deep research, my plans for the inference stack are not relevant. Any stack is interesting and might impact my inference stack preference.

## To 5. **Local‑only vs “cloud‑style” scores**

I'm also okay with *multi‑GPU BF16 numbers* that illustrate the “ceiling” of un‑quantized performance.

---