feat: Jinja template thinking toggle, Qwen3.5-9B GGUF Q8_0

- Thinking/Instruct toggle via Jinja template patching in llama-cpp
  backend: creates separate handlers for thinking-enabled and
  thinking-disabled modes
- Replace lovedheart/Qwen3.5-9B-FP8 (safetensors, 15.8GB OOM) with
  unsloth/Qwen3.5-9B-GGUF Q8_0 (9.2GB, fits)
- Enable flash_attn in llama-cpp for better performance
- GGUF path resolution falls back to flat gguf/ directory

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
tlg
2026-04-06 09:44:02 +02:00
parent 7a0ff55eb5
commit 7c4bbe0b29
2 changed files with 68 additions and 19 deletions

View File

@@ -1,10 +1,11 @@
physical_models:
qwen3.5-9b-fp8:
type: llm
backend: transformers
model_id: "lovedheart/Qwen3.5-9B-FP8"
estimated_vram_gb: 9
supports_vision: true
backend: llamacpp
model_id: "unsloth/Qwen3.5-9B-GGUF"
model_file: "Qwen3.5-9B-Q8_0.gguf"
estimated_vram_gb: 10
supports_vision: false
supports_tools: true
qwen3.5-9b-fp8-uncensored: