From e6be9dcb8590b27ef754cc2385fffa8b9509b4d3234d9512deb4defa9764a099 Mon Sep 17 00:00:00 2001 From: tlg Date: Fri, 3 Apr 2026 13:15:46 +0200 Subject: [PATCH] Add llmux design specification Covers architecture, model registry, VRAM management, API endpoints, container setup, Open WebUI integration, Traefik routing, and four-phase testing plan. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../specs/2026-04-03-llmux-design.md | 482 ++++++++++++++++++ 1 file changed, 482 insertions(+) create mode 100644 kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md diff --git a/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md b/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md new file mode 100644 index 0000000..d1e33be --- /dev/null +++ b/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md @@ -0,0 +1,482 @@ +# llmux Design Specification + +## Overview + +llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system. + +## Hardware Constraints + +- GPU: NVIDIA RTX 5070 Ti, 16GB VRAM, compute capability 12.0 (Blackwell/SM12.0) +- CPU: AMD Ryzen 9 9900X +- RAM: 64GB DDR5 +- Storage: ~1.3TB free on /home +- OS: Debian 12 (Bookworm) +- NVIDIA driver: 590.48 (CUDA 13.1 capable) +- Host CUDA toolkit: 12.8 + +## Architecture + +### Single Process Design + +llmux is a monolithic FastAPI application. One Python process handles all model loading/unloading, VRAM management, and inference routing. This keeps the system simple and gives full control over GPU memory. + +### Runtimes + +Three inference runtimes coexist within the single process: + +| Runtime | Purpose | Models | +|---------|---------|--------| +| transformers (HuggingFace) | HF safetensors models | Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe | +| llama-cpp-python | GGUF models | Qwen3.5-9B-FP8-Uncensored | +| chatterbox | TTS | Chatterbox-Turbo, Chatterbox-Multilingual, Chatterbox | + +### Why transformers (not vLLM) + +vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). The PyTorch transformers stack works with PyTorch 2.7+ and CUDA 12.8+ on SM12.0. vLLM can be reconsidered once SM12.0 support matures. + +## Physical Models + +| ID | Type | Backend | HuggingFace / Source | Estimated VRAM | Vision | Tools | +|----|------|---------|---------------------|---------------|--------|-------| +| qwen3.5-9b-fp8 | LLM | transformers | lovedheart/Qwen3.5-9B-FP8 | ~9GB | yes | yes | +| qwen3.5-9b-fp8-uncensored | LLM | llamacpp | HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 GGUF + mmproj GGUF) | ~9GB | yes | yes | +| qwen3.5-4b | LLM | transformers | Qwen/Qwen3.5-4B | ~4GB | yes | yes | +| gpt-oss-20b | LLM | transformers | openai/gpt-oss-20b (MXFP4 quantized MoE, designed for 16GB VRAM) | ~13GB | no | yes | +| gpt-oss-20b-uncensored | LLM | transformers | aoxo/gpt-oss-20b-uncensored | ~13GB | no | yes | +| cohere-transcribe | ASR | transformers | CohereLabs/cohere-transcribe-03-2026 (gated, terms accepted) | ~4GB | n/a | n/a | +| chatterbox-turbo | TTS | chatterbox | resemble-ai/chatterbox (turbo variant) | ~2GB | n/a | n/a | +| chatterbox-multilingual | TTS | chatterbox | resemble-ai/chatterbox (multilingual variant) | ~2GB | n/a | n/a | +| chatterbox | TTS | chatterbox | resemble-ai/chatterbox (default variant) | ~2GB | n/a | n/a | + +## Virtual Models + +Virtual models are what Open WebUI and API clients see. Multiple virtual models can map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model has zero VRAM cost. + +| Virtual Model Name | Physical Model | Behavior | +|--------------------|---------------|----------| +| Qwen3.5-9B-FP8-Thinking | qwen3.5-9b-fp8 | Thinking enabled (default Qwen3.5 behavior) | +| Qwen3.5-9B-FP8-Instruct | qwen3.5-9b-fp8 | enable_thinking=False | +| Qwen3.5-9B-FP8-Uncensored-Thinking | qwen3.5-9b-fp8-uncensored | Thinking enabled | +| Qwen3.5-9B-FP8-Uncensored-Instruct | qwen3.5-9b-fp8-uncensored | enable_thinking=False | +| Qwen3.5-4B-Thinking | qwen3.5-4b | Thinking enabled | +| Qwen3.5-4B-Instruct | qwen3.5-4b | enable_thinking=False | +| GPT-OSS-20B-Low | gpt-oss-20b | System prompt prefix: "Reasoning: low" | +| GPT-OSS-20B-Medium | gpt-oss-20b | System prompt prefix: "Reasoning: medium" | +| GPT-OSS-20B-High | gpt-oss-20b | System prompt prefix: "Reasoning: high" | +| GPT-OSS-20B-Uncensored-Low | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: low" | +| GPT-OSS-20B-Uncensored-Medium | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: medium" | +| GPT-OSS-20B-Uncensored-High | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: high" | +| cohere-transcribe | cohere-transcribe | ASR (used via /v1/audio/transcriptions) | +| Chatterbox-Turbo | chatterbox-turbo | TTS (used via /v1/audio/speech) | +| Chatterbox-Multilingual | chatterbox-multilingual | TTS | +| Chatterbox | chatterbox | TTS | + +## VRAM Manager + +### Preemption Policy + +Models remain loaded until VRAM is needed for another model. No idle timeout — a model stays in VRAM indefinitely until evicted. + +### Priority (highest to lowest) + +1. ASR (cohere-transcribe) — highest priority, evicted only as last resort +2. TTS (one Chatterbox variant at a time) +3. LLM (one at a time) — lowest priority, evicted first + +### Loading Algorithm + +When a request arrives for a model whose physical model is not loaded: + +1. If the physical model is already loaded, proceed immediately. +2. If it fits in available VRAM, load alongside existing models. +3. If it doesn't fit, evict models by priority (lowest first) until enough VRAM is free: + - Evict LLM first + - Evict TTS second + - Evict ASR only as last resort + - Never evict a higher-priority model to load a lower-priority one +4. Load the requested model. + +### Concurrency + +- An asyncio Lock ensures only one load/unload operation at a time. +- Requests arriving during a model swap await the lock. +- Inference requests hold a read-lock on their model to prevent eviction mid-inference. + +### Typical Scenarios + +| Current State | Request | Action | +|---------------|---------|--------| +| ASR + Qwen3.5-4B (~8GB) | Chat with Qwen3.5-4B | Proceed, already loaded | +| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with Qwen3.5-9B-FP8 | Evict LLM (4B), load 9B (~9GB). ASR+TTS+9B≈15GB, fits. | +| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with GPT-OSS-20B | Evict LLM first, then TTS, then ASR if needed. Load gpt-oss-20b alone (~13GB). | +| GPT-OSS-20B loaded (~13GB) | Transcription request | Evict LLM (gpt-oss-20b). Load ASR (~4GB). | +| ASR + Qwen3.5-4B (~8GB) | TTS request | Fits (~10GB). Load Chatterbox alongside. | + +## API Endpoints + +All endpoints on `127.0.0.1:8081`. All `/v1/*` endpoints require Bearer token authentication. + +### GET /v1/models + +Returns all 16 virtual models in OpenAI format, regardless of what's currently loaded. Users can freely select any model; llmux handles swapping. + +### POST /v1/chat/completions + +OpenAI-compatible chat completions. Accepts `model` parameter matching a virtual model name. Supports `stream: true` for SSE streaming. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) are applied transparently. Tool/function calling is passed through to models that support it. + +### POST /v1/audio/transcriptions + +OpenAI Whisper-compatible endpoint. Accepts multipart form with audio file and `model` parameter. Returns transcript in OpenAI response format. Supports `language` parameter (required by cohere-transcribe — default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. + +### POST /v1/audio/speech + +OpenAI TTS-compatible endpoint. Accepts JSON with `model`, `input` (text), `voice` (maps to Chatterbox voice/speaker config). Returns audio bytes. + +### GET /health + +Unauthenticated. Returns service status and currently loaded models. + +## Authentication + +- All `/v1/*` endpoints require a Bearer token (`Authorization: Bearer `) +- API keys stored in `config/api_keys.yaml`, mounted read-only into the container +- Multiple keys: one per client (Open WebUI, remote Whisper, OpenCode, etc.) +- `GET /health` is unauthenticated for monitoring/readiness probes +- Traefik acts purely as a router, no auth on its side + +## Container & Pod Architecture + +### Pod + +- Pod name: `llmux_pod` +- Single container: `llmux_ctr` +- Port: `127.0.0.1:8081:8081` +- GPU: NVIDIA CDI (`--device nvidia.com/gpu=all`) +- Network: default (no host loopback needed) + +### Base Image + +`pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime` + +Verified compatible with SM12.0 (Blackwell). PyTorch 2.7+ with CUDA 12.8+ supports RTX 5070 Ti. Host driver 590.48 (CUDA 13.1) is backwards compatible. + +### Dockerfile Layers + +1. System deps: libsndfile, ffmpeg (audio processing) +2. pip install: FastAPI, uvicorn, transformers (>=5.4.0), llama-cpp-python (CUDA build), chatterbox, soundfile, librosa, sentencepiece, protobuf, PyYAML +3. Copy llmux application code +4. Entrypoint: `uvicorn llmux.main:app --host 0.0.0.0 --port 8081` + +### Bind Mounts + +| Host Path | Container Path | Mode | +|-----------|---------------|------| +| /home/llm/.local/share/llmux_pod/models/ | /models | read-only | +| /home/llm/.local/share/llmux_pod/config/ | /config | read-only | + +### Systemd + +Managed via `create_pod_llmux.sh` following the Kischdle pattern: create pod, create container, generate systemd units, enable service. + +## Application Structure + +``` +llmux/ +├── Dockerfile +├── requirements.txt +├── config/ +│ ├── models.yaml +│ └── api_keys.yaml +├── llmux/ +│ ├── main.py # FastAPI app, startup/shutdown, health endpoint +│ ├── auth.py # API key validation middleware +│ ├── vram_manager.py # VRAM tracking, load/unload, eviction logic +│ ├── model_registry.py # Parse models.yaml, virtual→physical mapping +│ ├── routes/ +│ │ ├── models.py # GET /v1/models +│ │ ├── chat.py # POST /v1/chat/completions +│ │ ├── transcription.py # POST /v1/audio/transcriptions +│ │ └── speech.py # POST /v1/audio/speech +│ └── backends/ +│ ├── base.py # Abstract base class for model backends +│ ├── transformers.py # HuggingFace transformers backend +│ ├── llamacpp.py # llama-cpp-python backend (GGUF) +│ └── chatterbox.py # Chatterbox TTS backend +└── scripts/ + ├── download_models.sh # Pre-download all model weights + └── create_pod_llmux.sh # Podman pod creation script +``` + +### Key Design Decisions + +- `backends/` encapsulates runtime differences. Each backend knows how to load, unload, and run inference. Route handlers are backend-agnostic. +- `vram_manager.py` is the single authority on what's loaded. Route handlers call `vram_manager.ensure_loaded(physical_model_id)` before inference. +- `model_registry.py` handles virtual-to-physical mapping and injects behavior params (thinking toggle, system prompts) before passing to the backend. +- Streaming for chat completions uses FastAPI `StreamingResponse` with SSE, matching OpenAI streaming format. + +## Model Downloads + +All models are pre-downloaded before the pod is created. The `scripts/download_models.sh` script runs as user `llm` and downloads to `/home/llm/.local/share/llmux_pod/models/`. + +| Model | Method | Approx Size | +|-------|--------|-------------| +| lovedheart/Qwen3.5-9B-FP8 | huggingface-cli download | ~9GB | +| HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 + mmproj GGUF) | huggingface-cli download (specific files) | ~10GB | +| Qwen/Qwen3.5-4B | huggingface-cli download | ~8GB | +| openai/gpt-oss-20b | huggingface-cli download | ~13GB | +| aoxo/gpt-oss-20b-uncensored | huggingface-cli download | ~13GB | +| CohereLabs/cohere-transcribe-03-2026 | huggingface-cli download (gated, terms accepted) | ~4GB | +| resemble-ai/chatterbox (3 variants) | per Chatterbox install docs | ~2GB | + +Total estimated: ~60GB. The script is idempotent (skips existing models). A HuggingFace access token is required for gated models (stored at ~/.cache/huggingface/token). + +## Open WebUI Configuration + +Open WebUI (user `wbg`, port 8080) connects to llmux: + +### Connections (Admin > Settings > Connections) + +- OpenAI API Base URL: `http://127.0.0.1:8081/v1` +- API Key: the key from api_keys.yaml designated for Open WebUI + +### Audio (Admin > Settings > Audio) + +- STT Engine: openai +- STT OpenAI API Base URL: `http://127.0.0.1:8081/v1` +- STT Model: cohere-transcribe +- TTS Engine: openai +- TTS OpenAI API Base URL: `http://127.0.0.1:8081/v1` +- TTS Model: Chatterbox-Multilingual +- TTS Voice: to be configured based on Chatterbox options + +### User Experience + +- Model dropdown lists all 16 virtual models +- Chat works on any model selection (with potential swap delay for first request) +- Dictation uses cohere-transcribe +- Audio playback uses Chatterbox +- Voice chat combines ASR, LLM, and TTS + +## Traefik Routing + +New dynamic config at `/home/trf/.local/share/traefik_pod/dynamic/llmux.yml`: + +```yaml +http: + routers: + llmux: + entryPoints: ["wghttp"] + rule: "Host(`kidirekt.kischdle.com`)" + priority: 100 + service: llmux + + services: + llmux: + loadBalancer: + servers: + - url: "http://10.0.2.2:8081" +``` + +- Routed through WireGuard VPN entry point +- No Traefik-level auth (llmux handles API key auth) +- DNS setup for kidirekt.kischdle.com is a manual step + +## Configuration Files + +### config/models.yaml + +```yaml +physical_models: + qwen3.5-9b-fp8: + type: llm + backend: transformers + model_id: "lovedheart/Qwen3.5-9B-FP8" + estimated_vram_gb: 9 + supports_vision: true + supports_tools: true + + qwen3.5-9b-fp8-uncensored: + type: llm + backend: llamacpp + model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf" + mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf" + estimated_vram_gb: 9 + supports_vision: true + supports_tools: true + + qwen3.5-4b: + type: llm + backend: transformers + model_id: "Qwen/Qwen3.5-4B" + estimated_vram_gb: 4 + supports_vision: true + supports_tools: true + + gpt-oss-20b: + type: llm + backend: transformers + model_id: "openai/gpt-oss-20b" + estimated_vram_gb: 13 + supports_vision: false + supports_tools: true + + gpt-oss-20b-uncensored: + type: llm + backend: transformers + model_id: "aoxo/gpt-oss-20b-uncensored" + estimated_vram_gb: 13 + supports_vision: false + supports_tools: true + + cohere-transcribe: + type: asr + backend: transformers + model_id: "CohereLabs/cohere-transcribe-03-2026" + estimated_vram_gb: 4 + default_language: "en" + + chatterbox-turbo: + type: tts + backend: chatterbox + variant: "turbo" + estimated_vram_gb: 2 + + chatterbox-multilingual: + type: tts + backend: chatterbox + variant: "multilingual" + estimated_vram_gb: 2 + + chatterbox: + type: tts + backend: chatterbox + variant: "default" + estimated_vram_gb: 2 + +virtual_models: + Qwen3.5-9B-FP8-Thinking: + physical: qwen3.5-9b-fp8 + params: { enable_thinking: true } + Qwen3.5-9B-FP8-Instruct: + physical: qwen3.5-9b-fp8 + params: { enable_thinking: false } + + Qwen3.5-9B-FP8-Uncensored-Thinking: + physical: qwen3.5-9b-fp8-uncensored + params: { enable_thinking: true } + Qwen3.5-9B-FP8-Uncensored-Instruct: + physical: qwen3.5-9b-fp8-uncensored + params: { enable_thinking: false } + + Qwen3.5-4B-Thinking: + physical: qwen3.5-4b + params: { enable_thinking: true } + Qwen3.5-4B-Instruct: + physical: qwen3.5-4b + params: { enable_thinking: false } + + GPT-OSS-20B-Low: + physical: gpt-oss-20b + params: { system_prompt_prefix: "Reasoning: low" } + GPT-OSS-20B-Medium: + physical: gpt-oss-20b + params: { system_prompt_prefix: "Reasoning: medium" } + GPT-OSS-20B-High: + physical: gpt-oss-20b + params: { system_prompt_prefix: "Reasoning: high" } + + GPT-OSS-20B-Uncensored-Low: + physical: gpt-oss-20b-uncensored + params: { system_prompt_prefix: "Reasoning: low" } + GPT-OSS-20B-Uncensored-Medium: + physical: gpt-oss-20b-uncensored + params: { system_prompt_prefix: "Reasoning: medium" } + GPT-OSS-20B-Uncensored-High: + physical: gpt-oss-20b-uncensored + params: { system_prompt_prefix: "Reasoning: high" } + + cohere-transcribe: + physical: cohere-transcribe + Chatterbox-Turbo: + physical: chatterbox-turbo + Chatterbox-Multilingual: + physical: chatterbox-multilingual + Chatterbox: + physical: chatterbox +``` + +### config/api_keys.yaml + +```yaml +api_keys: + - key: "sk-llmux-openwebui-" + name: "Open WebUI" + - key: "sk-llmux-whisper-" + name: "Remote Whisper clients" + - key: "sk-llmux-opencode-" + name: "OpenCode" +``` + +Keys generated at deployment time. + +## Testing & Verification + +### Phase 1: System Integration (iterative, fix issues before proceeding) + +1. Container build — Dockerfile builds successfully, image contains all dependencies +2. GPU passthrough — container sees RTX 5070 Ti (nvidia-smi works inside container) +3. Model mount — container can read model weights from /models +4. Service startup — llmux starts, port 8081 reachable from host +5. Open WebUI connection — model list populates in Open WebUI +6. Traefik routing — kidirekt.kischdle.com routes to llmux (when DNS configured) +7. Systemd lifecycle — start/stop/restart works, service survives reboot + +### Phase 2: Functional Tests + +8. Auth — requests without valid API key get 401 +9. Model listing — GET /v1/models returns all 16 virtual models +10. Chat inference — for each physical LLM, chat via Open WebUI as user "try": + - Qwen3.5-9B-FP8 (Thinking + Instruct) + - Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct) + - Qwen3.5-4B (Thinking + Instruct) + - GPT-OSS-20B (Low, Medium, High) + - GPT-OSS-20B-Uncensored (Low, Medium, High) +11. Streaming — chat responses stream token-by-token in Open WebUI +12. ASR — Open WebUI dictation transcribes speech (English and German) +13. TTS — Open WebUI audio playback speaks text +14. Vision — image + text prompt to each vision-capable model: + - Qwen3.5-4B + - Qwen3.5-9B-FP8 + - Qwen3.5-9B-FP8-Uncensored +15. Tool usage — verify tool calling for each runtime and tool-capable model: + - Qwen3.5-9B-FP8 (transformers) + - Qwen3.5-9B-FP8-Uncensored (llama-cpp-python) + - GPT-OSS-20B (transformers) + - GPT-OSS-20B-Uncensored (transformers) + +### Phase 3: VRAM Management Tests + +16. Small LLM — load Qwen3.5-4B (~4GB), verify ASR and TTS remain loaded (~10GB total) +17. Medium LLM — load Qwen3.5-9B-FP8 (~9GB), verify ASR and TTS remain loaded (~15GB total) +18. Large LLM — load GPT-OSS-20B (~13GB), verify ASR and TTS are evicted. Next ASR request evicts LLM first. +19. Model swapping — switch between two LLMs, verify second loads and first is evicted + +### Phase 4: Performance Tests + +20. Transformers GPU vs CPU — for each transformers-backed physical model, run same prompt on GPU and CPU, verify GPU is at least 5x faster. Requires admin test endpoint or CLI tool to force CPU execution. + - Qwen3.5-9B-FP8 + - Qwen3.5-4B + - gpt-oss-20b + - gpt-oss-20b-uncensored + - cohere-transcribe +21. llama-cpp-python GPU vs CPU — run inference for Qwen3.5-9B-FP8-Uncensored with n_gpu_layers=-1 (GPU) and n_gpu_layers=0 (CPU), verify GPU is at least 5x faster. Same admin test endpoint. +22. Chatterbox performance — run TTS synthesis, verify audio generation time is reasonable relative to audio duration. + +## Manual Steps + +These require human action and cannot be automated: + +- DNS setup for kidirekt.kischdle.com (during implementation) +- HuggingFace terms for cohere-transcribe: accepted 2026-04-03 +- HuggingFace token configured at ~/.cache/huggingface/token (done for user tlg, needs setup for user llm during deployment) +- Open WebUI admin configuration (connections, audio settings)