Commit Graph

36 Commits

Author SHA256 Message Date
tlg
06923d51b4 fix: streaming response fix + GPT-OSS-20B-Uncensored MXFP4 GGUF
- Fix async generator streaming: _stream_generate yields directly
  instead of returning nested _iter(), route handler awaits generate()
  then passes async generator to StreamingResponse
- Replace aoxo/gpt-oss-20b-uncensored (no quant, OOM) with
  HauhauCS MXFP4 GGUF via llama-cpp backend

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 22:21:22 +02:00
tlg
61308703dc feat: replace gpt-oss-20b-uncensored with HauhauCS MXFP4 GGUF
aoxo model had no quantization (BF16, ~40GB OOM). HauhauCS model
uses MXFP4 GGUF format, loads at 11.9GB via llama-cpp backend.
All three reasoning levels (Low/Medium/High) work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 16:41:41 +02:00
tlg
7c4bbe0b29 feat: Jinja template thinking toggle, Qwen3.5-9B GGUF Q8_0
- Thinking/Instruct toggle via Jinja template patching in llama-cpp
  backend: creates separate handlers for thinking-enabled and
  thinking-disabled modes
- Replace lovedheart/Qwen3.5-9B-FP8 (safetensors, 15.8GB OOM) with
  unsloth/Qwen3.5-9B-GGUF Q8_0 (9.2GB, fits)
- Enable flash_attn in llama-cpp for better performance
- GGUF path resolution falls back to flat gguf/ directory

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 09:44:02 +02:00
tlg
7a0ff55eb5 fix: remove unsupported KV cache quantization in llama-cpp backend
GGML_TYPE_Q8_0 for type_k/type_v not supported in this llama-cpp-python
version. Keep reduced n_ctx=4096 for VRAM savings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 23:35:05 +02:00
tlg
da35e94b16 fix: add triton kernels for MXFP4, fix GGUF KV cache quantization
- Add 'kernels' package to Dockerfile for native MXFP4 execution
  (fixes gpt-oss-20b OOM: 15.2GB→13.5GB)
- Reduce GGUF n_ctx from 8192 to 4096 and quantize KV cache to Q8_0
  to reduce VRAM usage
- Use GGML_TYPE_Q8_0 constant instead of string for type_k/type_v

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 22:49:16 +02:00
tlg
a88f0afb8a chore: add .gitignore for venv, caches, and local dirs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 22:17:42 +02:00
tlg
d615bb4553 fix: Chatterbox uses separate classes per variant, remove turbo
ChatterboxTTS and ChatterboxMultilingualTTS are separate classes.
Turbo variant doesn't exist in chatterbox-tts 0.1.7.
Multilingual generate() requires language_id parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:43:40 +02:00
tlg
f24a225baf fix: resolve GGUF paths through HF cache, add model_id to GGUF config
llama-cpp-python backend now uses huggingface_hub to resolve GGUF
file paths within the HF cache structure instead of assuming flat
/models/ directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:33:36 +02:00
tlg
38e1523d7e feat: proper VRAM cleanup and admin clear-vram endpoint
- gc.collect() + torch.cuda.empty_cache() in unload for reliable VRAM release
- POST /admin/clear-vram endpoint unloads all models and reports GPU memory
- VRAMManager.clear_all() method for programmatic VRAM cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:03:39 +02:00
tlg
aa7a160118 fix: proper VRAM cleanup on model unload + CUDA alloc config
- Force gc.collect() before torch.cuda.empty_cache() to ensure all
  model references are released
- Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in container

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:59:23 +02:00
tlg
d3285bad8a fix: add accelerate package for transformers device_map support
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:19:17 +02:00
tlg
f2f73d204c fix: Dockerfile multi-stage build with working dependency resolution
- Multi-stage: devel image builds llama-cpp-python with CUDA, runtime
  image gets the compiled library via COPY
- chatterbox-tts installed --no-deps to prevent torch 2.6 downgrade
- librosa and diskcache added as explicit chatterbox/llama-cpp deps
- All imports verified with GPU passthrough

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 15:46:34 +02:00
tlg
d6a3fe5427 fix: Dockerfile uses explicit pip install, skip pre-installed packages
Removed librosa (unused), torch, pyyaml from install list since
they're in the base image. Avoid numpy rebuild conflict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 14:10:07 +02:00
tlg
8816a06369 fix: add --break-system-packages for pip in container
PyTorch base image uses PEP 668 externally-managed Python.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 14:07:14 +02:00
tlg
8a6f6a5097 fix: use LLMUX_SRC env var for Dockerfile path in pod creation script
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 13:05:38 +02:00
tlg
d5a98879c9 fix: use full Docker Hub registry path in Dockerfile
Podman requires docker.io/ prefix when unqualified-search registries
are not configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 13:04:53 +02:00
tlg
2f4d242f55 fix: use llm venv paths for huggingface-cli and python in download script
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 12:52:09 +02:00
tlg
1a26d34ea5 feat: Dockerfile, model download script, and pod creation script
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:09:34 +02:00
tlg
17818a3860 feat: FastAPI app assembly with all routes and backend wiring
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:04:56 +02:00
tlg
d55c80ae35 feat: API routes for models, chat, transcription, speech, and admin
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:04:45 +02:00
tlg
ef44bc09b9 feat: Chatterbox TTS backend with turbo/multilingual/default variants 2026-04-04 09:40:42 +02:00
tlg
c6677dcab3 feat: llama-cpp-python backend with GGUF, vision, and tool support 2026-04-04 09:40:40 +02:00
tlg
de25b5e2a7 feat: transformers ASR backend for cohere-transcribe 2026-04-04 09:40:39 +02:00
tlg
449e37d318 feat: abstract base class for model backends
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:29:35 +02:00
tlg
813bbe0ad0 fix: VRAM eviction cascades through all tiers for large LLM loads
The original eviction logic blocked ASR eviction even when an LLM
genuinely needed all 16GB VRAM (e.g., gpt-oss-20b at 13GB). Now uses
two-pass eviction: first evicts lower/same priority, then cascades to
higher priority as last resort. Added tests for ASR-survives and
full-cascade scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:22:14 +02:00
tlg
d7a091df8c feat: VRAM manager with priority-based model eviction
Tracks GPU VRAM usage (16GB) and handles model loading/unloading with
priority-based eviction: LLM (lowest) -> TTS -> ASR (highest, protected).
Uses asyncio Lock for concurrency safety.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:14:41 +02:00
tlg
969bcb3292 feat: API key authentication dependency
Implements create_api_key_dependency() FastAPI dependency that validates
Bearer tokens against a configured list of ApiKey objects (401 on missing,
malformed, or unknown tokens). Includes 5 TDD tests covering all cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:31:30 +02:00
tlg
c4eaf5088b feat: model registry with virtual-to-physical resolution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:31:10 +02:00
tlg
690ad46d88 feat: config loading for models.yaml and api_keys.yaml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:30:13 +02:00
tlg
a64f32b590 feat: project scaffolding with config files and test fixtures
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:23:14 +02:00
tlg
cf7c77b3b5 Add llmux implementation plan (30 tasks)
Covers project scaffolding, config, auth, VRAM manager, all four
backends, API routes, Dockerfile, deployment scripts, and four
phases of testing (integration, functional, VRAM, performance).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 22:43:37 +02:00
tlg
45947e80a4 Update manual steps: DNS done, Open WebUI config automated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 22:25:51 +02:00
tlg
7187c58c5e Add llmux product requirements in StrictDoc format
42 requirements covering architecture, runtimes, models, VRAM
management, API, authentication, configuration, integration,
and four-phase testing plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 21:11:05 +02:00
tlg
bd0ed74d32 Clarify VRAM eviction rule for cross-priority edge case
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 13:20:53 +02:00
tlg
e6be9dcb85 Add llmux design specification
Covers architecture, model registry, VRAM management, API endpoints,
container setup, Open WebUI integration, Traefik routing, and
four-phase testing plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 13:15:46 +02:00
tlg
e7cf075e2f Initial commit with .gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 17:58:54 +02:00