Files
prompts/AI-Models_Research.md
2026-04-30 12:30:14 +02:00

79 lines
2.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AI models research
## Quantization impact on Qwen 27B
### My on-premise setup
Just as a side info: I run a NVIDIA RTX5070 Ti with 16 GB VRAM and it's Blackwell architecture allows performance improvements with 4-bit quantized AI models.
### Motivation
I want to find out how quantization degrades intelligence and improves speed by looking at real-world reports and comparisons.
### AI model
The model I would like to see compared is 'Qwen3.6-27B' but since this model is pretty new it also would be OK to see comparisons of 'Qwen3.5-27B'.
I am interested in both reasoning / thinking mode and instruct mode.
### Quantizations
Comparisons between the original model weights size 16-bit and quantized with 8-bit and (most important) quantized with 4-bit are desired.
### Your task
Please perform a deep research to find the requested experience reports and comparisons.
### Ask questions first
Before starting, ask me between 2 and 5 questions to completely understand the situation and your task.
---
Instruction following, terminal coding, logic reasoning
Only Qwen models with 27B or less than 27B
---
My on-premise setup was provided just as a side info. No need to take it into account for the deep research.
To 1.: No model offloading at all.
To 2.: My inference framework plans are not relevant for the deep research.
To 3.: Instruction following, terminal coding, logic reasoning.
To 4.: Comparisons with FP4 would be great, yes, try to find such reports.
---
You are wrong. Here are the Hugging Face model webpages to show you the models exist (but obviouosly were released after your Knowledge-Cutoff date):
- [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)
---
## To 1. **Precision formats you care about most**
- With “16bit”, I mean the original Qwen model which has BF16 tensors.
- With “8bit”, I mean all of GPTQInt8, AWQInt8, and GGUFQ8_0.
- With “4bit”, I mean all of GPTQInt4, AWQInt4, GGUFQ4_K_M or other variants like NVFP4.
I am open to whichever 4bit/8bit quantization is beststudied for Qwen 27B models.
## To 2. **Workload focus: reasoning vs coding vs general chat**
I care most for these two: *reasoning* and *code generation / debugging*
## To 3. **Metric priorities**
For “intelligence loss”, I want *standard eval scores* and *taskspecific passrates*.
For “speed”, I care for both *firsttoken latency* and *throughput*.
## To 4. **Inference stack hints**
For the deep research, my plans for the inference stack are not relevant. Any stack is interesting and might impact my inference stack preference.
## To 5. **Localonly vs “cloudstyle” scores**
I'm also okay with *multiGPU BF16 numbers* that illustrate the “ceiling” of unquantized performance.
---