- Thinking/Instruct toggle via Jinja template patching in llama-cpp backend: creates separate handlers for thinking-enabled and thinking-disabled modes - Replace lovedheart/Qwen3.5-9B-FP8 (safetensors, 15.8GB OOM) with unsloth/Qwen3.5-9B-GGUF Q8_0 (9.2GB, fits) - Enable flash_attn in llama-cpp for better performance - GGUF path resolution falls back to flat gguf/ directory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.9 KiB
2.9 KiB