Added three prompt files

This commit is contained in:
tlg
2026-04-30 12:30:14 +02:00
parent f004d00c28
commit ded8d3c5fb
3 changed files with 269 additions and 0 deletions

78
AI-Models_Research.md Normal file
View File

@@ -0,0 +1,78 @@
# AI models research
## Quantization impact on Qwen 27B
### My on-premise setup
Just as a side info: I run a NVIDIA RTX5070 Ti with 16 GB VRAM and it's Blackwell architecture allows performance improvements with 4-bit quantized AI models.
### Motivation
I want to find out how quantization degrades intelligence and improves speed by looking at real-world reports and comparisons.
### AI model
The model I would like to see compared is 'Qwen3.6-27B' but since this model is pretty new it also would be OK to see comparisons of 'Qwen3.5-27B'.
I am interested in both reasoning / thinking mode and instruct mode.
### Quantizations
Comparisons between the original model weights size 16-bit and quantized with 8-bit and (most important) quantized with 4-bit are desired.
### Your task
Please perform a deep research to find the requested experience reports and comparisons.
### Ask questions first
Before starting, ask me between 2 and 5 questions to completely understand the situation and your task.
---
Instruction following, terminal coding, logic reasoning
Only Qwen models with 27B or less than 27B
---
My on-premise setup was provided just as a side info. No need to take it into account for the deep research.
To 1.: No model offloading at all.
To 2.: My inference framework plans are not relevant for the deep research.
To 3.: Instruction following, terminal coding, logic reasoning.
To 4.: Comparisons with FP4 would be great, yes, try to find such reports.
---
You are wrong. Here are the Hugging Face model webpages to show you the models exist (but obviouosly were released after your Knowledge-Cutoff date):
- [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)
---
## To 1. **Precision formats you care about most**
- With “16bit”, I mean the original Qwen model which has BF16 tensors.
- With “8bit”, I mean all of GPTQInt8, AWQInt8, and GGUFQ8_0.
- With “4bit”, I mean all of GPTQInt4, AWQInt4, GGUFQ4_K_M or other variants like NVFP4.
I am open to whichever 4bit/8bit quantization is beststudied for Qwen 27B models.
## To 2. **Workload focus: reasoning vs coding vs general chat**
I care most for these two: *reasoning* and *code generation / debugging*
## To 3. **Metric priorities**
For “intelligence loss”, I want *standard eval scores* and *taskspecific passrates*.
For “speed”, I care for both *firsttoken latency* and *throughput*.
## To 4. **Inference stack hints**
For the deep research, my plans for the inference stack are not relevant. Any stack is interesting and might impact my inference stack preference.
## To 5. **Localonly vs “cloudstyle” scores**
I'm also okay with *multiGPU BF16 numbers* that illustrate the “ceiling” of unquantized performance.
---