From e6be9dcb8590b27ef754cc2385fffa8b9509b4d3234d9512deb4defa9764a099 Mon Sep 17 00:00:00 2001
From: tlg <thomas.langer@destengs.com>
Date: Fri, 3 Apr 2026 13:15:46 +0200
Subject: [PATCH] Add llmux design specification

Covers architecture, model registry, VRAM management, API endpoints,
container setup, Open WebUI integration, Traefik routing, and
four-phase testing plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../specs/2026-04-03-llmux-design.md          | 482 ++++++++++++++++++
 1 file changed, 482 insertions(+)
 create mode 100644 kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md

diff --git a/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md b/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md
new file mode 100644
index 0000000..d1e33be
--- /dev/null
+++ b/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md
@@ -0,0 +1,482 @@
+# llmux Design Specification
+
+## Overview
+
+llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system.
+
+## Hardware Constraints
+
+- GPU: NVIDIA RTX 5070 Ti, 16GB VRAM, compute capability 12.0 (Blackwell/SM12.0)
+- CPU: AMD Ryzen 9 9900X
+- RAM: 64GB DDR5
+- Storage: ~1.3TB free on /home
+- OS: Debian 12 (Bookworm)
+- NVIDIA driver: 590.48 (CUDA 13.1 capable)
+- Host CUDA toolkit: 12.8
+
+## Architecture
+
+### Single Process Design
+
+llmux is a monolithic FastAPI application. One Python process handles all model loading/unloading, VRAM management, and inference routing. This keeps the system simple and gives full control over GPU memory.
+
+### Runtimes
+
+Three inference runtimes coexist within the single process:
+
+| Runtime | Purpose | Models |
+|---------|---------|--------|
+| transformers (HuggingFace) | HF safetensors models | Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe |
+| llama-cpp-python | GGUF models | Qwen3.5-9B-FP8-Uncensored |
+| chatterbox | TTS | Chatterbox-Turbo, Chatterbox-Multilingual, Chatterbox |
+
+### Why transformers (not vLLM)
+
+vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). The PyTorch transformers stack works with PyTorch 2.7+ and CUDA 12.8+ on SM12.0. vLLM can be reconsidered once SM12.0 support matures.
+
+## Physical Models
+
+| ID | Type | Backend | HuggingFace / Source | Estimated VRAM | Vision | Tools |
+|----|------|---------|---------------------|---------------|--------|-------|
+| qwen3.5-9b-fp8 | LLM | transformers | lovedheart/Qwen3.5-9B-FP8 | ~9GB | yes | yes |
+| qwen3.5-9b-fp8-uncensored | LLM | llamacpp | HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 GGUF + mmproj GGUF) | ~9GB | yes | yes |
+| qwen3.5-4b | LLM | transformers | Qwen/Qwen3.5-4B | ~4GB | yes | yes |
+| gpt-oss-20b | LLM | transformers | openai/gpt-oss-20b (MXFP4 quantized MoE, designed for 16GB VRAM) | ~13GB | no | yes |
+| gpt-oss-20b-uncensored | LLM | transformers | aoxo/gpt-oss-20b-uncensored | ~13GB | no | yes |
+| cohere-transcribe | ASR | transformers | CohereLabs/cohere-transcribe-03-2026 (gated, terms accepted) | ~4GB | n/a | n/a |
+| chatterbox-turbo | TTS | chatterbox | resemble-ai/chatterbox (turbo variant) | ~2GB | n/a | n/a |
+| chatterbox-multilingual | TTS | chatterbox | resemble-ai/chatterbox (multilingual variant) | ~2GB | n/a | n/a |
+| chatterbox | TTS | chatterbox | resemble-ai/chatterbox (default variant) | ~2GB | n/a | n/a |
+
+## Virtual Models
+
+Virtual models are what Open WebUI and API clients see. Multiple virtual models can map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model has zero VRAM cost.
+
+| Virtual Model Name | Physical Model | Behavior |
+|--------------------|---------------|----------|
+| Qwen3.5-9B-FP8-Thinking | qwen3.5-9b-fp8 | Thinking enabled (default Qwen3.5 behavior) |
+| Qwen3.5-9B-FP8-Instruct | qwen3.5-9b-fp8 | enable_thinking=False |
+| Qwen3.5-9B-FP8-Uncensored-Thinking | qwen3.5-9b-fp8-uncensored | Thinking enabled |
+| Qwen3.5-9B-FP8-Uncensored-Instruct | qwen3.5-9b-fp8-uncensored | enable_thinking=False |
+| Qwen3.5-4B-Thinking | qwen3.5-4b | Thinking enabled |
+| Qwen3.5-4B-Instruct | qwen3.5-4b | enable_thinking=False |
+| GPT-OSS-20B-Low | gpt-oss-20b | System prompt prefix: "Reasoning: low" |
+| GPT-OSS-20B-Medium | gpt-oss-20b | System prompt prefix: "Reasoning: medium" |
+| GPT-OSS-20B-High | gpt-oss-20b | System prompt prefix: "Reasoning: high" |
+| GPT-OSS-20B-Uncensored-Low | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: low" |
+| GPT-OSS-20B-Uncensored-Medium | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: medium" |
+| GPT-OSS-20B-Uncensored-High | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: high" |
+| cohere-transcribe | cohere-transcribe | ASR (used via /v1/audio/transcriptions) |
+| Chatterbox-Turbo | chatterbox-turbo | TTS (used via /v1/audio/speech) |
+| Chatterbox-Multilingual | chatterbox-multilingual | TTS |
+| Chatterbox | chatterbox | TTS |
+
+## VRAM Manager
+
+### Preemption Policy
+
+Models remain loaded until VRAM is needed for another model. No idle timeout — a model stays in VRAM indefinitely until evicted.
+
+### Priority (highest to lowest)
+
+1. ASR (cohere-transcribe) — highest priority, evicted only as last resort
+2. TTS (one Chatterbox variant at a time)
+3. LLM (one at a time) — lowest priority, evicted first
+
+### Loading Algorithm
+
+When a request arrives for a model whose physical model is not loaded:
+
+1. If the physical model is already loaded, proceed immediately.
+2. If it fits in available VRAM, load alongside existing models.
+3. If it doesn't fit, evict models by priority (lowest first) until enough VRAM is free:
+   - Evict LLM first
+   - Evict TTS second
+   - Evict ASR only as last resort
+   - Never evict a higher-priority model to load a lower-priority one
+4. Load the requested model.
+
+### Concurrency
+
+- An asyncio Lock ensures only one load/unload operation at a time.
+- Requests arriving during a model swap await the lock.
+- Inference requests hold a read-lock on their model to prevent eviction mid-inference.
+
+### Typical Scenarios
+
+| Current State | Request | Action |
+|---------------|---------|--------|
+| ASR + Qwen3.5-4B (~8GB) | Chat with Qwen3.5-4B | Proceed, already loaded |
+| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with Qwen3.5-9B-FP8 | Evict LLM (4B), load 9B (~9GB). ASR+TTS+9B≈15GB, fits. |
+| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with GPT-OSS-20B | Evict LLM first, then TTS, then ASR if needed. Load gpt-oss-20b alone (~13GB). |
+| GPT-OSS-20B loaded (~13GB) | Transcription request | Evict LLM (gpt-oss-20b). Load ASR (~4GB). |
+| ASR + Qwen3.5-4B (~8GB) | TTS request | Fits (~10GB). Load Chatterbox alongside. |
+
+## API Endpoints
+
+All endpoints on `127.0.0.1:8081`. All `/v1/*` endpoints require Bearer token authentication.
+
+### GET /v1/models
+
+Returns all 16 virtual models in OpenAI format, regardless of what's currently loaded. Users can freely select any model; llmux handles swapping.
+
+### POST /v1/chat/completions
+
+OpenAI-compatible chat completions. Accepts `model` parameter matching a virtual model name. Supports `stream: true` for SSE streaming. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) are applied transparently. Tool/function calling is passed through to models that support it.
+
+### POST /v1/audio/transcriptions
+
+OpenAI Whisper-compatible endpoint. Accepts multipart form with audio file and `model` parameter. Returns transcript in OpenAI response format. Supports `language` parameter (required by cohere-transcribe — default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
+
+### POST /v1/audio/speech
+
+OpenAI TTS-compatible endpoint. Accepts JSON with `model`, `input` (text), `voice` (maps to Chatterbox voice/speaker config). Returns audio bytes.
+
+### GET /health
+
+Unauthenticated. Returns service status and currently loaded models.
+
+## Authentication
+
+- All `/v1/*` endpoints require a Bearer token (`Authorization: Bearer <api-key>`)
+- API keys stored in `config/api_keys.yaml`, mounted read-only into the container
+- Multiple keys: one per client (Open WebUI, remote Whisper, OpenCode, etc.)
+- `GET /health` is unauthenticated for monitoring/readiness probes
+- Traefik acts purely as a router, no auth on its side
+
+## Container & Pod Architecture
+
+### Pod
+
+- Pod name: `llmux_pod`
+- Single container: `llmux_ctr`
+- Port: `127.0.0.1:8081:8081`
+- GPU: NVIDIA CDI (`--device nvidia.com/gpu=all`)
+- Network: default (no host loopback needed)
+
+### Base Image
+
+`pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime`
+
+Verified compatible with SM12.0 (Blackwell). PyTorch 2.7+ with CUDA 12.8+ supports RTX 5070 Ti. Host driver 590.48 (CUDA 13.1) is backwards compatible.
+
+### Dockerfile Layers
+
+1. System deps: libsndfile, ffmpeg (audio processing)
+2. pip install: FastAPI, uvicorn, transformers (>=5.4.0), llama-cpp-python (CUDA build), chatterbox, soundfile, librosa, sentencepiece, protobuf, PyYAML
+3. Copy llmux application code
+4. Entrypoint: `uvicorn llmux.main:app --host 0.0.0.0 --port 8081`
+
+### Bind Mounts
+
+| Host Path | Container Path | Mode |
+|-----------|---------------|------|
+| /home/llm/.local/share/llmux_pod/models/ | /models | read-only |
+| /home/llm/.local/share/llmux_pod/config/ | /config | read-only |
+
+### Systemd
+
+Managed via `create_pod_llmux.sh` following the Kischdle pattern: create pod, create container, generate systemd units, enable service.
+
+## Application Structure
+
+```
+llmux/
+├── Dockerfile
+├── requirements.txt
+├── config/
+│   ├── models.yaml
+│   └── api_keys.yaml
+├── llmux/
+│   ├── main.py              # FastAPI app, startup/shutdown, health endpoint
+│   ├── auth.py              # API key validation middleware
+│   ├── vram_manager.py      # VRAM tracking, load/unload, eviction logic
+│   ├── model_registry.py    # Parse models.yaml, virtual→physical mapping
+│   ├── routes/
+│   │   ├── models.py        # GET /v1/models
+│   │   ├── chat.py          # POST /v1/chat/completions
+│   │   ├── transcription.py # POST /v1/audio/transcriptions
+│   │   └── speech.py        # POST /v1/audio/speech
+│   └── backends/
+│       ├── base.py          # Abstract base class for model backends
+│       ├── transformers.py  # HuggingFace transformers backend
+│       ├── llamacpp.py      # llama-cpp-python backend (GGUF)
+│       └── chatterbox.py    # Chatterbox TTS backend
+└── scripts/
+    ├── download_models.sh   # Pre-download all model weights
+    └── create_pod_llmux.sh  # Podman pod creation script
+```
+
+### Key Design Decisions
+
+- `backends/` encapsulates runtime differences. Each backend knows how to load, unload, and run inference. Route handlers are backend-agnostic.
+- `vram_manager.py` is the single authority on what's loaded. Route handlers call `vram_manager.ensure_loaded(physical_model_id)` before inference.
+- `model_registry.py` handles virtual-to-physical mapping and injects behavior params (thinking toggle, system prompts) before passing to the backend.
+- Streaming for chat completions uses FastAPI `StreamingResponse` with SSE, matching OpenAI streaming format.
+
+## Model Downloads
+
+All models are pre-downloaded before the pod is created. The `scripts/download_models.sh` script runs as user `llm` and downloads to `/home/llm/.local/share/llmux_pod/models/`.
+
+| Model | Method | Approx Size |
+|-------|--------|-------------|
+| lovedheart/Qwen3.5-9B-FP8 | huggingface-cli download | ~9GB |
+| HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 + mmproj GGUF) | huggingface-cli download (specific files) | ~10GB |
+| Qwen/Qwen3.5-4B | huggingface-cli download | ~8GB |
+| openai/gpt-oss-20b | huggingface-cli download | ~13GB |
+| aoxo/gpt-oss-20b-uncensored | huggingface-cli download | ~13GB |
+| CohereLabs/cohere-transcribe-03-2026 | huggingface-cli download (gated, terms accepted) | ~4GB |
+| resemble-ai/chatterbox (3 variants) | per Chatterbox install docs | ~2GB |
+
+Total estimated: ~60GB. The script is idempotent (skips existing models). A HuggingFace access token is required for gated models (stored at ~/.cache/huggingface/token).
+
+## Open WebUI Configuration
+
+Open WebUI (user `wbg`, port 8080) connects to llmux:
+
+### Connections (Admin > Settings > Connections)
+
+- OpenAI API Base URL: `http://127.0.0.1:8081/v1`
+- API Key: the key from api_keys.yaml designated for Open WebUI
+
+### Audio (Admin > Settings > Audio)
+
+- STT Engine: openai
+- STT OpenAI API Base URL: `http://127.0.0.1:8081/v1`
+- STT Model: cohere-transcribe
+- TTS Engine: openai
+- TTS OpenAI API Base URL: `http://127.0.0.1:8081/v1`
+- TTS Model: Chatterbox-Multilingual
+- TTS Voice: to be configured based on Chatterbox options
+
+### User Experience
+
+- Model dropdown lists all 16 virtual models
+- Chat works on any model selection (with potential swap delay for first request)
+- Dictation uses cohere-transcribe
+- Audio playback uses Chatterbox
+- Voice chat combines ASR, LLM, and TTS
+
+## Traefik Routing
+
+New dynamic config at `/home/trf/.local/share/traefik_pod/dynamic/llmux.yml`:
+
+```yaml
+http:
+  routers:
+    llmux:
+      entryPoints: ["wghttp"]
+      rule: "Host(`kidirekt.kischdle.com`)"
+      priority: 100
+      service: llmux
+
+  services:
+    llmux:
+      loadBalancer:
+        servers:
+          - url: "http://10.0.2.2:8081"
+```
+
+- Routed through WireGuard VPN entry point
+- No Traefik-level auth (llmux handles API key auth)
+- DNS setup for kidirekt.kischdle.com is a manual step
+
+## Configuration Files
+
+### config/models.yaml
+
+```yaml
+physical_models:
+  qwen3.5-9b-fp8:
+    type: llm
+    backend: transformers
+    model_id: "lovedheart/Qwen3.5-9B-FP8"
+    estimated_vram_gb: 9
+    supports_vision: true
+    supports_tools: true
+
+  qwen3.5-9b-fp8-uncensored:
+    type: llm
+    backend: llamacpp
+    model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf"
+    mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
+    estimated_vram_gb: 9
+    supports_vision: true
+    supports_tools: true
+
+  qwen3.5-4b:
+    type: llm
+    backend: transformers
+    model_id: "Qwen/Qwen3.5-4B"
+    estimated_vram_gb: 4
+    supports_vision: true
+    supports_tools: true
+
+  gpt-oss-20b:
+    type: llm
+    backend: transformers
+    model_id: "openai/gpt-oss-20b"
+    estimated_vram_gb: 13
+    supports_vision: false
+    supports_tools: true
+
+  gpt-oss-20b-uncensored:
+    type: llm
+    backend: transformers
+    model_id: "aoxo/gpt-oss-20b-uncensored"
+    estimated_vram_gb: 13
+    supports_vision: false
+    supports_tools: true
+
+  cohere-transcribe:
+    type: asr
+    backend: transformers
+    model_id: "CohereLabs/cohere-transcribe-03-2026"
+    estimated_vram_gb: 4
+    default_language: "en"
+
+  chatterbox-turbo:
+    type: tts
+    backend: chatterbox
+    variant: "turbo"
+    estimated_vram_gb: 2
+
+  chatterbox-multilingual:
+    type: tts
+    backend: chatterbox
+    variant: "multilingual"
+    estimated_vram_gb: 2
+
+  chatterbox:
+    type: tts
+    backend: chatterbox
+    variant: "default"
+    estimated_vram_gb: 2
+
+virtual_models:
+  Qwen3.5-9B-FP8-Thinking:
+    physical: qwen3.5-9b-fp8
+    params: { enable_thinking: true }
+  Qwen3.5-9B-FP8-Instruct:
+    physical: qwen3.5-9b-fp8
+    params: { enable_thinking: false }
+
+  Qwen3.5-9B-FP8-Uncensored-Thinking:
+    physical: qwen3.5-9b-fp8-uncensored
+    params: { enable_thinking: true }
+  Qwen3.5-9B-FP8-Uncensored-Instruct:
+    physical: qwen3.5-9b-fp8-uncensored
+    params: { enable_thinking: false }
+
+  Qwen3.5-4B-Thinking:
+    physical: qwen3.5-4b
+    params: { enable_thinking: true }
+  Qwen3.5-4B-Instruct:
+    physical: qwen3.5-4b
+    params: { enable_thinking: false }
+
+  GPT-OSS-20B-Low:
+    physical: gpt-oss-20b
+    params: { system_prompt_prefix: "Reasoning: low" }
+  GPT-OSS-20B-Medium:
+    physical: gpt-oss-20b
+    params: { system_prompt_prefix: "Reasoning: medium" }
+  GPT-OSS-20B-High:
+    physical: gpt-oss-20b
+    params: { system_prompt_prefix: "Reasoning: high" }
+
+  GPT-OSS-20B-Uncensored-Low:
+    physical: gpt-oss-20b-uncensored
+    params: { system_prompt_prefix: "Reasoning: low" }
+  GPT-OSS-20B-Uncensored-Medium:
+    physical: gpt-oss-20b-uncensored
+    params: { system_prompt_prefix: "Reasoning: medium" }
+  GPT-OSS-20B-Uncensored-High:
+    physical: gpt-oss-20b-uncensored
+    params: { system_prompt_prefix: "Reasoning: high" }
+
+  cohere-transcribe:
+    physical: cohere-transcribe
+  Chatterbox-Turbo:
+    physical: chatterbox-turbo
+  Chatterbox-Multilingual:
+    physical: chatterbox-multilingual
+  Chatterbox:
+    physical: chatterbox
+```
+
+### config/api_keys.yaml
+
+```yaml
+api_keys:
+  - key: "sk-llmux-openwebui-<generated>"
+    name: "Open WebUI"
+  - key: "sk-llmux-whisper-<generated>"
+    name: "Remote Whisper clients"
+  - key: "sk-llmux-opencode-<generated>"
+    name: "OpenCode"
+```
+
+Keys generated at deployment time.
+
+## Testing & Verification
+
+### Phase 1: System Integration (iterative, fix issues before proceeding)
+
+1. Container build — Dockerfile builds successfully, image contains all dependencies
+2. GPU passthrough — container sees RTX 5070 Ti (nvidia-smi works inside container)
+3. Model mount — container can read model weights from /models
+4. Service startup — llmux starts, port 8081 reachable from host
+5. Open WebUI connection — model list populates in Open WebUI
+6. Traefik routing — kidirekt.kischdle.com routes to llmux (when DNS configured)
+7. Systemd lifecycle — start/stop/restart works, service survives reboot
+
+### Phase 2: Functional Tests
+
+8. Auth — requests without valid API key get 401
+9. Model listing — GET /v1/models returns all 16 virtual models
+10. Chat inference — for each physical LLM, chat via Open WebUI as user "try":
+    - Qwen3.5-9B-FP8 (Thinking + Instruct)
+    - Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct)
+    - Qwen3.5-4B (Thinking + Instruct)
+    - GPT-OSS-20B (Low, Medium, High)
+    - GPT-OSS-20B-Uncensored (Low, Medium, High)
+11. Streaming — chat responses stream token-by-token in Open WebUI
+12. ASR — Open WebUI dictation transcribes speech (English and German)
+13. TTS — Open WebUI audio playback speaks text
+14. Vision — image + text prompt to each vision-capable model:
+    - Qwen3.5-4B
+    - Qwen3.5-9B-FP8
+    - Qwen3.5-9B-FP8-Uncensored
+15. Tool usage — verify tool calling for each runtime and tool-capable model:
+    - Qwen3.5-9B-FP8 (transformers)
+    - Qwen3.5-9B-FP8-Uncensored (llama-cpp-python)
+    - GPT-OSS-20B (transformers)
+    - GPT-OSS-20B-Uncensored (transformers)
+
+### Phase 3: VRAM Management Tests
+
+16. Small LLM — load Qwen3.5-4B (~4GB), verify ASR and TTS remain loaded (~10GB total)
+17. Medium LLM — load Qwen3.5-9B-FP8 (~9GB), verify ASR and TTS remain loaded (~15GB total)
+18. Large LLM — load GPT-OSS-20B (~13GB), verify ASR and TTS are evicted. Next ASR request evicts LLM first.
+19. Model swapping — switch between two LLMs, verify second loads and first is evicted
+
+### Phase 4: Performance Tests
+
+20. Transformers GPU vs CPU — for each transformers-backed physical model, run same prompt on GPU and CPU, verify GPU is at least 5x faster. Requires admin test endpoint or CLI tool to force CPU execution.
+    - Qwen3.5-9B-FP8
+    - Qwen3.5-4B
+    - gpt-oss-20b
+    - gpt-oss-20b-uncensored
+    - cohere-transcribe
+21. llama-cpp-python GPU vs CPU — run inference for Qwen3.5-9B-FP8-Uncensored with n_gpu_layers=-1 (GPU) and n_gpu_layers=0 (CPU), verify GPU is at least 5x faster. Same admin test endpoint.
+22. Chatterbox performance — run TTS synthesis, verify audio generation time is reasonable relative to audio duration.
+
+## Manual Steps
+
+These require human action and cannot be automated:
+
+- DNS setup for kidirekt.kischdle.com (during implementation)
+- HuggingFace terms for cohere-transcribe: accepted 2026-04-03
+- HuggingFace token configured at ~/.cache/huggingface/token (done for user tlg, needs setup for user llm during deployment)
+- Open WebUI admin configuration (connections, audio settings)