- Add harmony.py: strip GPT-OSS-20B analysis/thinking channel from both
streaming and non-streaming responses (HarmonyStreamFilter + extract_final_text)
- Add per-model asyncio.Lock in llamacpp backend to prevent concurrent C++
access that caused container segfaults (exit 139)
- Fix chat handler swap for streaming: move inside _stream_generate within
lock scope (was broken by try/finally running before stream was consumed)
- Filter /v1/models to return only LLM models (hide ASR/TTS from chat dropdown)
- Correct Qwen3.5-4B estimated_vram_gb: 4 → 9 (actual allocation ~8GB)
- Add GPU memory verification after eviction with retry loop in vram_manager
- Add HF_TOKEN_PATH support in main.py for gated model access
- Add /v1/audio/models and /v1/audio/voices discovery endpoints (no auth)
- Add OOM error handling in both backends and chat route
- Add AUDIO_STT_SUPPORTED_CONTENT_TYPES for webm/wav/mp3/ogg
- Add performance test script (scripts/perf_test.py)
- Update tests to match current config (42 tests pass)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aoxo model had no quantization (BF16, ~40GB OOM). HauhauCS model
uses MXFP4 GGUF format, loads at 11.9GB via llama-cpp backend.
All three reasoning levels (Low/Medium/High) work.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Thinking/Instruct toggle via Jinja template patching in llama-cpp
backend: creates separate handlers for thinking-enabled and
thinking-disabled modes
- Replace lovedheart/Qwen3.5-9B-FP8 (safetensors, 15.8GB OOM) with
unsloth/Qwen3.5-9B-GGUF Q8_0 (9.2GB, fits)
- Enable flash_attn in llama-cpp for better performance
- GGUF path resolution falls back to flat gguf/ directory
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GGML_TYPE_Q8_0 for type_k/type_v not supported in this llama-cpp-python
version. Keep reduced n_ctx=4096 for VRAM savings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add 'kernels' package to Dockerfile for native MXFP4 execution
(fixes gpt-oss-20b OOM: 15.2GB→13.5GB)
- Reduce GGUF n_ctx from 8192 to 4096 and quantize KV cache to Q8_0
to reduce VRAM usage
- Use GGML_TYPE_Q8_0 constant instead of string for type_k/type_v
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChatterboxTTS and ChatterboxMultilingualTTS are separate classes.
Turbo variant doesn't exist in chatterbox-tts 0.1.7.
Multilingual generate() requires language_id parameter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llama-cpp-python backend now uses huggingface_hub to resolve GGUF
file paths within the HF cache structure instead of assuming flat
/models/ directory.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- gc.collect() + torch.cuda.empty_cache() in unload for reliable VRAM release
- POST /admin/clear-vram endpoint unloads all models and reports GPU memory
- VRAMManager.clear_all() method for programmatic VRAM cleanup
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Force gc.collect() before torch.cuda.empty_cache() to ensure all
model references are released
- Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in container
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Multi-stage: devel image builds llama-cpp-python with CUDA, runtime
image gets the compiled library via COPY
- chatterbox-tts installed --no-deps to prevent torch 2.6 downgrade
- librosa and diskcache added as explicit chatterbox/llama-cpp deps
- All imports verified with GPU passthrough
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed librosa (unused), torch, pyyaml from install list since
they're in the base image. Avoid numpy rebuild conflict.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Podman requires docker.io/ prefix when unqualified-search registries
are not configured.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original eviction logic blocked ASR eviction even when an LLM
genuinely needed all 16GB VRAM (e.g., gpt-oss-20b at 13GB). Now uses
two-pass eviction: first evicts lower/same priority, then cascades to
higher priority as last resort. Added tests for ASR-survives and
full-cascade scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks GPU VRAM usage (16GB) and handles model loading/unloading with
priority-based eviction: LLM (lowest) -> TTS -> ASR (highest, protected).
Uses asyncio Lock for concurrency safety.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements create_api_key_dependency() FastAPI dependency that validates
Bearer tokens against a configured list of ApiKey objects (401 on missing,
malformed, or unknown tokens). Includes 5 TDD tests covering all cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers project scaffolding, config, auth, VRAM manager, all four
backends, API routes, Dockerfile, deployment scripts, and four
phases of testing (integration, functional, VRAM, performance).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers architecture, model registry, VRAM management, API endpoints,
container setup, Open WebUI integration, Traefik routing, and
four-phase testing plan.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>