DesTEngSsv006_swd

SHA256

Author	SHA256	Message	Date
tlg	3edc055299	fix: Open WebUI integration — Harmony stripping, VRAM eviction, concurrency lock - Add harmony.py: strip GPT-OSS-20B analysis/thinking channel from both streaming and non-streaming responses (HarmonyStreamFilter + extract_final_text) - Add per-model asyncio.Lock in llamacpp backend to prevent concurrent C++ access that caused container segfaults (exit 139) - Fix chat handler swap for streaming: move inside _stream_generate within lock scope (was broken by try/finally running before stream was consumed) - Filter /v1/models to return only LLM models (hide ASR/TTS from chat dropdown) - Correct Qwen3.5-4B estimated_vram_gb: 4 → 9 (actual allocation ~8GB) - Add GPU memory verification after eviction with retry loop in vram_manager - Add HF_TOKEN_PATH support in main.py for gated model access - Add /v1/audio/models and /v1/audio/voices discovery endpoints (no auth) - Add OOM error handling in both backends and chat route - Add AUDIO_STT_SUPPORTED_CONTENT_TYPES for webm/wav/mp3/ogg - Add performance test script (scripts/perf_test.py) - Update tests to match current config (42 tests pass) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 21:50:39 +02:00
tlg	06923d51b4	fix: streaming response fix + GPT-OSS-20B-Uncensored MXFP4 GGUF - Fix async generator streaming: _stream_generate yields directly instead of returning nested _iter(), route handler awaits generate() then passes async generator to StreamingResponse - Replace aoxo/gpt-oss-20b-uncensored (no quant, OOM) with HauhauCS MXFP4 GGUF via llama-cpp backend Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 22:21:22 +02:00
tlg	61308703dc	feat: replace gpt-oss-20b-uncensored with HauhauCS MXFP4 GGUF aoxo model had no quantization (BF16, ~40GB OOM). HauhauCS model uses MXFP4 GGUF format, loads at 11.9GB via llama-cpp backend. All three reasoning levels (Low/Medium/High) work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 16:41:41 +02:00
tlg	7c4bbe0b29	feat: Jinja template thinking toggle, Qwen3.5-9B GGUF Q8_0 - Thinking/Instruct toggle via Jinja template patching in llama-cpp backend: creates separate handlers for thinking-enabled and thinking-disabled modes - Replace lovedheart/Qwen3.5-9B-FP8 (safetensors, 15.8GB OOM) with unsloth/Qwen3.5-9B-GGUF Q8_0 (9.2GB, fits) - Enable flash_attn in llama-cpp for better performance - GGUF path resolution falls back to flat gguf/ directory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 09:44:02 +02:00
tlg	7a0ff55eb5	fix: remove unsupported KV cache quantization in llama-cpp backend GGML_TYPE_Q8_0 for type_k/type_v not supported in this llama-cpp-python version. Keep reduced n_ctx=4096 for VRAM savings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 23:35:05 +02:00
tlg	da35e94b16	fix: add triton kernels for MXFP4, fix GGUF KV cache quantization - Add 'kernels' package to Dockerfile for native MXFP4 execution (fixes gpt-oss-20b OOM: 15.2GB→13.5GB) - Reduce GGUF n_ctx from 8192 to 4096 and quantize KV cache to Q8_0 to reduce VRAM usage - Use GGML_TYPE_Q8_0 constant instead of string for type_k/type_v Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 22:49:16 +02:00
tlg	a88f0afb8a	chore: add .gitignore for venv, caches, and local dirs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 22:17:42 +02:00
tlg	d615bb4553	fix: Chatterbox uses separate classes per variant, remove turbo ChatterboxTTS and ChatterboxMultilingualTTS are separate classes. Turbo variant doesn't exist in chatterbox-tts 0.1.7. Multilingual generate() requires language_id parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 21:43:40 +02:00
tlg	f24a225baf	fix: resolve GGUF paths through HF cache, add model_id to GGUF config llama-cpp-python backend now uses huggingface_hub to resolve GGUF file paths within the HF cache structure instead of assuming flat /models/ directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 21:33:36 +02:00
tlg	38e1523d7e	feat: proper VRAM cleanup and admin clear-vram endpoint - gc.collect() + torch.cuda.empty_cache() in unload for reliable VRAM release - POST /admin/clear-vram endpoint unloads all models and reports GPU memory - VRAMManager.clear_all() method for programmatic VRAM cleanup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 21:03:39 +02:00
tlg	aa7a160118	fix: proper VRAM cleanup on model unload + CUDA alloc config - Force gc.collect() before torch.cuda.empty_cache() to ensure all model references are released - Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in container Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 17:59:23 +02:00
tlg	d3285bad8a	fix: add accelerate package for transformers device_map support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 17:19:17 +02:00
tlg	f2f73d204c	fix: Dockerfile multi-stage build with working dependency resolution - Multi-stage: devel image builds llama-cpp-python with CUDA, runtime image gets the compiled library via COPY - chatterbox-tts installed --no-deps to prevent torch 2.6 downgrade - librosa and diskcache added as explicit chatterbox/llama-cpp deps - All imports verified with GPU passthrough Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 15:46:34 +02:00
tlg	d6a3fe5427	fix: Dockerfile uses explicit pip install, skip pre-installed packages Removed librosa (unused), torch, pyyaml from install list since they're in the base image. Avoid numpy rebuild conflict. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 14:10:07 +02:00
tlg	8816a06369	fix: add --break-system-packages for pip in container PyTorch base image uses PEP 668 externally-managed Python. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 14:07:14 +02:00
tlg	8a6f6a5097	fix: use LLMUX_SRC env var for Dockerfile path in pod creation script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 13:05:38 +02:00
tlg	d5a98879c9	fix: use full Docker Hub registry path in Dockerfile Podman requires docker.io/ prefix when unqualified-search registries are not configured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 13:04:53 +02:00
tlg	2f4d242f55	fix: use llm venv paths for huggingface-cli and python in download script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 12:52:09 +02:00
tlg	1a26d34ea5	feat: Dockerfile, model download script, and pod creation script Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 10:09:34 +02:00
tlg	17818a3860	feat: FastAPI app assembly with all routes and backend wiring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 10:04:56 +02:00
tlg	d55c80ae35	feat: API routes for models, chat, transcription, speech, and admin Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 10:04:45 +02:00
tlg	ef44bc09b9	feat: Chatterbox TTS backend with turbo/multilingual/default variants	2026-04-04 09:40:42 +02:00
tlg	c6677dcab3	feat: llama-cpp-python backend with GGUF, vision, and tool support	2026-04-04 09:40:40 +02:00
tlg	de25b5e2a7	feat: transformers ASR backend for cohere-transcribe	2026-04-04 09:40:39 +02:00
tlg	449e37d318	feat: abstract base class for model backends Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:29:35 +02:00
tlg	813bbe0ad0	fix: VRAM eviction cascades through all tiers for large LLM loads The original eviction logic blocked ASR eviction even when an LLM genuinely needed all 16GB VRAM (e.g., gpt-oss-20b at 13GB). Now uses two-pass eviction: first evicts lower/same priority, then cascades to higher priority as last resort. Added tests for ASR-survives and full-cascade scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:22:14 +02:00
tlg	d7a091df8c	feat: VRAM manager with priority-based model eviction Tracks GPU VRAM usage (16GB) and handles model loading/unloading with priority-based eviction: LLM (lowest) -> TTS -> ASR (highest, protected). Uses asyncio Lock for concurrency safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:14:41 +02:00
tlg	969bcb3292	feat: API key authentication dependency Implements create_api_key_dependency() FastAPI dependency that validates Bearer tokens against a configured list of ApiKey objects (401 on missing, malformed, or unknown tokens). Includes 5 TDD tests covering all cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 07:31:30 +02:00
tlg	c4eaf5088b	feat: model registry with virtual-to-physical resolution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 07:31:10 +02:00
tlg	690ad46d88	feat: config loading for models.yaml and api_keys.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 07:30:13 +02:00
tlg	a64f32b590	feat: project scaffolding with config files and test fixtures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 07:23:14 +02:00
tlg	cf7c77b3b5	Add llmux implementation plan (30 tasks) Covers project scaffolding, config, auth, VRAM manager, all four backends, API routes, Dockerfile, deployment scripts, and four phases of testing (integration, functional, VRAM, performance). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 22:43:37 +02:00
tlg	45947e80a4	Update manual steps: DNS done, Open WebUI config automated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 22:25:51 +02:00
tlg	7187c58c5e	Add llmux product requirements in StrictDoc format 42 requirements covering architecture, runtimes, models, VRAM management, API, authentication, configuration, integration, and four-phase testing plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 21:11:05 +02:00
tlg	bd0ed74d32	Clarify VRAM eviction rule for cross-priority edge case Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 13:20:53 +02:00
tlg	e6be9dcb85	Add llmux design specification Covers architecture, model registry, VRAM management, API endpoints, container setup, Open WebUI integration, Traefik routing, and four-phase testing plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 13:15:46 +02:00

36 Commits