Files
DesTEngSsv006_swd/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md
2026-04-03 13:20:53 +02:00

19 KiB

llmux Design Specification

Overview

llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system.

Hardware Constraints

  • GPU: NVIDIA RTX 5070 Ti, 16GB VRAM, compute capability 12.0 (Blackwell/SM12.0)
  • CPU: AMD Ryzen 9 9900X
  • RAM: 64GB DDR5
  • Storage: ~1.3TB free on /home
  • OS: Debian 12 (Bookworm)
  • NVIDIA driver: 590.48 (CUDA 13.1 capable)
  • Host CUDA toolkit: 12.8

Architecture

Single Process Design

llmux is a monolithic FastAPI application. One Python process handles all model loading/unloading, VRAM management, and inference routing. This keeps the system simple and gives full control over GPU memory.

Runtimes

Three inference runtimes coexist within the single process:

Runtime Purpose Models
transformers (HuggingFace) HF safetensors models Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe
llama-cpp-python GGUF models Qwen3.5-9B-FP8-Uncensored
chatterbox TTS Chatterbox-Turbo, Chatterbox-Multilingual, Chatterbox

Why transformers (not vLLM)

vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). The PyTorch transformers stack works with PyTorch 2.7+ and CUDA 12.8+ on SM12.0. vLLM can be reconsidered once SM12.0 support matures.

Physical Models

ID Type Backend HuggingFace / Source Estimated VRAM Vision Tools
qwen3.5-9b-fp8 LLM transformers lovedheart/Qwen3.5-9B-FP8 ~9GB yes yes
qwen3.5-9b-fp8-uncensored LLM llamacpp HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 GGUF + mmproj GGUF) ~9GB yes yes
qwen3.5-4b LLM transformers Qwen/Qwen3.5-4B ~4GB yes yes
gpt-oss-20b LLM transformers openai/gpt-oss-20b (MXFP4 quantized MoE, designed for 16GB VRAM) ~13GB no yes
gpt-oss-20b-uncensored LLM transformers aoxo/gpt-oss-20b-uncensored ~13GB no yes
cohere-transcribe ASR transformers CohereLabs/cohere-transcribe-03-2026 (gated, terms accepted) ~4GB n/a n/a
chatterbox-turbo TTS chatterbox resemble-ai/chatterbox (turbo variant) ~2GB n/a n/a
chatterbox-multilingual TTS chatterbox resemble-ai/chatterbox (multilingual variant) ~2GB n/a n/a
chatterbox TTS chatterbox resemble-ai/chatterbox (default variant) ~2GB n/a n/a

Virtual Models

Virtual models are what Open WebUI and API clients see. Multiple virtual models can map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model has zero VRAM cost.

Virtual Model Name Physical Model Behavior
Qwen3.5-9B-FP8-Thinking qwen3.5-9b-fp8 Thinking enabled (default Qwen3.5 behavior)
Qwen3.5-9B-FP8-Instruct qwen3.5-9b-fp8 enable_thinking=False
Qwen3.5-9B-FP8-Uncensored-Thinking qwen3.5-9b-fp8-uncensored Thinking enabled
Qwen3.5-9B-FP8-Uncensored-Instruct qwen3.5-9b-fp8-uncensored enable_thinking=False
Qwen3.5-4B-Thinking qwen3.5-4b Thinking enabled
Qwen3.5-4B-Instruct qwen3.5-4b enable_thinking=False
GPT-OSS-20B-Low gpt-oss-20b System prompt prefix: "Reasoning: low"
GPT-OSS-20B-Medium gpt-oss-20b System prompt prefix: "Reasoning: medium"
GPT-OSS-20B-High gpt-oss-20b System prompt prefix: "Reasoning: high"
GPT-OSS-20B-Uncensored-Low gpt-oss-20b-uncensored System prompt prefix: "Reasoning: low"
GPT-OSS-20B-Uncensored-Medium gpt-oss-20b-uncensored System prompt prefix: "Reasoning: medium"
GPT-OSS-20B-Uncensored-High gpt-oss-20b-uncensored System prompt prefix: "Reasoning: high"
cohere-transcribe cohere-transcribe ASR (used via /v1/audio/transcriptions)
Chatterbox-Turbo chatterbox-turbo TTS (used via /v1/audio/speech)
Chatterbox-Multilingual chatterbox-multilingual TTS
Chatterbox chatterbox TTS

VRAM Manager

Preemption Policy

Models remain loaded until VRAM is needed for another model. No idle timeout — a model stays in VRAM indefinitely until evicted.

Priority (highest to lowest)

  1. ASR (cohere-transcribe) — highest priority, evicted only as last resort
  2. TTS (one Chatterbox variant at a time)
  3. LLM (one at a time) — lowest priority, evicted first

Loading Algorithm

When a request arrives for a model whose physical model is not loaded:

  1. If the physical model is already loaded, proceed immediately.
  2. If it fits in available VRAM, load alongside existing models.
  3. If it doesn't fit, evict models by priority (lowest first) until enough VRAM is free:
    • Evict LLM first
    • Evict TTS second
    • Evict ASR only as last resort
    • Never evict a higher-priority model to load a lower-priority one (e.g., never evict ASR to make room for TTS; in that case, evict the LLM instead)
  4. Load the requested model.

Concurrency

  • An asyncio Lock ensures only one load/unload operation at a time.
  • Requests arriving during a model swap await the lock.
  • Inference requests hold a read-lock on their model to prevent eviction mid-inference.

Typical Scenarios

Current State Request Action
ASR + Qwen3.5-4B (~8GB) Chat with Qwen3.5-4B Proceed, already loaded
ASR + TTS + Qwen3.5-4B (~10GB) Chat with Qwen3.5-9B-FP8 Evict LLM (4B), load 9B (~9GB). ASR+TTS+9B≈15GB, fits.
ASR + TTS + Qwen3.5-4B (~10GB) Chat with GPT-OSS-20B Evict LLM first, then TTS, then ASR if needed. Load gpt-oss-20b alone (~13GB).
GPT-OSS-20B loaded (~13GB) Transcription request Evict LLM (gpt-oss-20b). Load ASR (~4GB).
ASR + Qwen3.5-4B (~8GB) TTS request Fits (~10GB). Load Chatterbox alongside.

API Endpoints

All endpoints on 127.0.0.1:8081. All /v1/* endpoints require Bearer token authentication.

GET /v1/models

Returns all 16 virtual models in OpenAI format, regardless of what's currently loaded. Users can freely select any model; llmux handles swapping.

POST /v1/chat/completions

OpenAI-compatible chat completions. Accepts model parameter matching a virtual model name. Supports stream: true for SSE streaming. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) are applied transparently. Tool/function calling is passed through to models that support it.

POST /v1/audio/transcriptions

OpenAI Whisper-compatible endpoint. Accepts multipart form with audio file and model parameter. Returns transcript in OpenAI response format. Supports language parameter (required by cohere-transcribe — default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.

POST /v1/audio/speech

OpenAI TTS-compatible endpoint. Accepts JSON with model, input (text), voice (maps to Chatterbox voice/speaker config). Returns audio bytes.

GET /health

Unauthenticated. Returns service status and currently loaded models.

Authentication

  • All /v1/* endpoints require a Bearer token (Authorization: Bearer <api-key>)
  • API keys stored in config/api_keys.yaml, mounted read-only into the container
  • Multiple keys: one per client (Open WebUI, remote Whisper, OpenCode, etc.)
  • GET /health is unauthenticated for monitoring/readiness probes
  • Traefik acts purely as a router, no auth on its side

Container & Pod Architecture

Pod

  • Pod name: llmux_pod
  • Single container: llmux_ctr
  • Port: 127.0.0.1:8081:8081
  • GPU: NVIDIA CDI (--device nvidia.com/gpu=all)
  • Network: default (no host loopback needed)

Base Image

pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime

Verified compatible with SM12.0 (Blackwell). PyTorch 2.7+ with CUDA 12.8+ supports RTX 5070 Ti. Host driver 590.48 (CUDA 13.1) is backwards compatible.

Dockerfile Layers

  1. System deps: libsndfile, ffmpeg (audio processing)
  2. pip install: FastAPI, uvicorn, transformers (>=5.4.0), llama-cpp-python (CUDA build), chatterbox, soundfile, librosa, sentencepiece, protobuf, PyYAML
  3. Copy llmux application code
  4. Entrypoint: uvicorn llmux.main:app --host 0.0.0.0 --port 8081

Bind Mounts

Host Path Container Path Mode
/home/llm/.local/share/llmux_pod/models/ /models read-only
/home/llm/.local/share/llmux_pod/config/ /config read-only

Systemd

Managed via create_pod_llmux.sh following the Kischdle pattern: create pod, create container, generate systemd units, enable service.

Application Structure

llmux/
├── Dockerfile
├── requirements.txt
├── config/
│   ├── models.yaml
│   └── api_keys.yaml
├── llmux/
│   ├── main.py              # FastAPI app, startup/shutdown, health endpoint
│   ├── auth.py              # API key validation middleware
│   ├── vram_manager.py      # VRAM tracking, load/unload, eviction logic
│   ├── model_registry.py    # Parse models.yaml, virtual→physical mapping
│   ├── routes/
│   │   ├── models.py        # GET /v1/models
│   │   ├── chat.py          # POST /v1/chat/completions
│   │   ├── transcription.py # POST /v1/audio/transcriptions
│   │   └── speech.py        # POST /v1/audio/speech
│   └── backends/
│       ├── base.py          # Abstract base class for model backends
│       ├── transformers.py  # HuggingFace transformers backend
│       ├── llamacpp.py      # llama-cpp-python backend (GGUF)
│       └── chatterbox.py    # Chatterbox TTS backend
└── scripts/
    ├── download_models.sh   # Pre-download all model weights
    └── create_pod_llmux.sh  # Podman pod creation script

Key Design Decisions

  • backends/ encapsulates runtime differences. Each backend knows how to load, unload, and run inference. Route handlers are backend-agnostic.
  • vram_manager.py is the single authority on what's loaded. Route handlers call vram_manager.ensure_loaded(physical_model_id) before inference.
  • model_registry.py handles virtual-to-physical mapping and injects behavior params (thinking toggle, system prompts) before passing to the backend.
  • Streaming for chat completions uses FastAPI StreamingResponse with SSE, matching OpenAI streaming format.

Model Downloads

All models are pre-downloaded before the pod is created. The scripts/download_models.sh script runs as user llm and downloads to /home/llm/.local/share/llmux_pod/models/.

Model Method Approx Size
lovedheart/Qwen3.5-9B-FP8 huggingface-cli download ~9GB
HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 + mmproj GGUF) huggingface-cli download (specific files) ~10GB
Qwen/Qwen3.5-4B huggingface-cli download ~8GB
openai/gpt-oss-20b huggingface-cli download ~13GB
aoxo/gpt-oss-20b-uncensored huggingface-cli download ~13GB
CohereLabs/cohere-transcribe-03-2026 huggingface-cli download (gated, terms accepted) ~4GB
resemble-ai/chatterbox (3 variants) per Chatterbox install docs ~2GB

Total estimated: ~60GB. The script is idempotent (skips existing models). A HuggingFace access token is required for gated models (stored at ~/.cache/huggingface/token).

Open WebUI Configuration

Open WebUI (user wbg, port 8080) connects to llmux:

Connections (Admin > Settings > Connections)

  • OpenAI API Base URL: http://127.0.0.1:8081/v1
  • API Key: the key from api_keys.yaml designated for Open WebUI

Audio (Admin > Settings > Audio)

  • STT Engine: openai
  • STT OpenAI API Base URL: http://127.0.0.1:8081/v1
  • STT Model: cohere-transcribe
  • TTS Engine: openai
  • TTS OpenAI API Base URL: http://127.0.0.1:8081/v1
  • TTS Model: Chatterbox-Multilingual
  • TTS Voice: to be configured based on Chatterbox options

User Experience

  • Model dropdown lists all 16 virtual models
  • Chat works on any model selection (with potential swap delay for first request)
  • Dictation uses cohere-transcribe
  • Audio playback uses Chatterbox
  • Voice chat combines ASR, LLM, and TTS

Traefik Routing

New dynamic config at /home/trf/.local/share/traefik_pod/dynamic/llmux.yml:

http:
  routers:
    llmux:
      entryPoints: ["wghttp"]
      rule: "Host(`kidirekt.kischdle.com`)"
      priority: 100
      service: llmux

  services:
    llmux:
      loadBalancer:
        servers:
          - url: "http://10.0.2.2:8081"
  • Routed through WireGuard VPN entry point
  • No Traefik-level auth (llmux handles API key auth)
  • DNS setup for kidirekt.kischdle.com is a manual step

Configuration Files

config/models.yaml

physical_models:
  qwen3.5-9b-fp8:
    type: llm
    backend: transformers
    model_id: "lovedheart/Qwen3.5-9B-FP8"
    estimated_vram_gb: 9
    supports_vision: true
    supports_tools: true

  qwen3.5-9b-fp8-uncensored:
    type: llm
    backend: llamacpp
    model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf"
    mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
    estimated_vram_gb: 9
    supports_vision: true
    supports_tools: true

  qwen3.5-4b:
    type: llm
    backend: transformers
    model_id: "Qwen/Qwen3.5-4B"
    estimated_vram_gb: 4
    supports_vision: true
    supports_tools: true

  gpt-oss-20b:
    type: llm
    backend: transformers
    model_id: "openai/gpt-oss-20b"
    estimated_vram_gb: 13
    supports_vision: false
    supports_tools: true

  gpt-oss-20b-uncensored:
    type: llm
    backend: transformers
    model_id: "aoxo/gpt-oss-20b-uncensored"
    estimated_vram_gb: 13
    supports_vision: false
    supports_tools: true

  cohere-transcribe:
    type: asr
    backend: transformers
    model_id: "CohereLabs/cohere-transcribe-03-2026"
    estimated_vram_gb: 4
    default_language: "en"

  chatterbox-turbo:
    type: tts
    backend: chatterbox
    variant: "turbo"
    estimated_vram_gb: 2

  chatterbox-multilingual:
    type: tts
    backend: chatterbox
    variant: "multilingual"
    estimated_vram_gb: 2

  chatterbox:
    type: tts
    backend: chatterbox
    variant: "default"
    estimated_vram_gb: 2

virtual_models:
  Qwen3.5-9B-FP8-Thinking:
    physical: qwen3.5-9b-fp8
    params: { enable_thinking: true }
  Qwen3.5-9B-FP8-Instruct:
    physical: qwen3.5-9b-fp8
    params: { enable_thinking: false }

  Qwen3.5-9B-FP8-Uncensored-Thinking:
    physical: qwen3.5-9b-fp8-uncensored
    params: { enable_thinking: true }
  Qwen3.5-9B-FP8-Uncensored-Instruct:
    physical: qwen3.5-9b-fp8-uncensored
    params: { enable_thinking: false }

  Qwen3.5-4B-Thinking:
    physical: qwen3.5-4b
    params: { enable_thinking: true }
  Qwen3.5-4B-Instruct:
    physical: qwen3.5-4b
    params: { enable_thinking: false }

  GPT-OSS-20B-Low:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: low" }
  GPT-OSS-20B-Medium:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: medium" }
  GPT-OSS-20B-High:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: high" }

  GPT-OSS-20B-Uncensored-Low:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: low" }
  GPT-OSS-20B-Uncensored-Medium:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: medium" }
  GPT-OSS-20B-Uncensored-High:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: high" }

  cohere-transcribe:
    physical: cohere-transcribe
  Chatterbox-Turbo:
    physical: chatterbox-turbo
  Chatterbox-Multilingual:
    physical: chatterbox-multilingual
  Chatterbox:
    physical: chatterbox

config/api_keys.yaml

api_keys:
  - key: "sk-llmux-openwebui-<generated>"
    name: "Open WebUI"
  - key: "sk-llmux-whisper-<generated>"
    name: "Remote Whisper clients"
  - key: "sk-llmux-opencode-<generated>"
    name: "OpenCode"

Keys generated at deployment time.

Testing & Verification

Phase 1: System Integration (iterative, fix issues before proceeding)

  1. Container build — Dockerfile builds successfully, image contains all dependencies
  2. GPU passthrough — container sees RTX 5070 Ti (nvidia-smi works inside container)
  3. Model mount — container can read model weights from /models
  4. Service startup — llmux starts, port 8081 reachable from host
  5. Open WebUI connection — model list populates in Open WebUI
  6. Traefik routing — kidirekt.kischdle.com routes to llmux (when DNS configured)
  7. Systemd lifecycle — start/stop/restart works, service survives reboot

Phase 2: Functional Tests

  1. Auth — requests without valid API key get 401
  2. Model listing — GET /v1/models returns all 16 virtual models
  3. Chat inference — for each physical LLM, chat via Open WebUI as user "try":
    • Qwen3.5-9B-FP8 (Thinking + Instruct)
    • Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct)
    • Qwen3.5-4B (Thinking + Instruct)
    • GPT-OSS-20B (Low, Medium, High)
    • GPT-OSS-20B-Uncensored (Low, Medium, High)
  4. Streaming — chat responses stream token-by-token in Open WebUI
  5. ASR — Open WebUI dictation transcribes speech (English and German)
  6. TTS — Open WebUI audio playback speaks text
  7. Vision — image + text prompt to each vision-capable model:
    • Qwen3.5-4B
    • Qwen3.5-9B-FP8
    • Qwen3.5-9B-FP8-Uncensored
  8. Tool usage — verify tool calling for each runtime and tool-capable model:
    • Qwen3.5-9B-FP8 (transformers)
    • Qwen3.5-9B-FP8-Uncensored (llama-cpp-python)
    • GPT-OSS-20B (transformers)
    • GPT-OSS-20B-Uncensored (transformers)

Phase 3: VRAM Management Tests

  1. Small LLM — load Qwen3.5-4B (~4GB), verify ASR and TTS remain loaded (~10GB total)
  2. Medium LLM — load Qwen3.5-9B-FP8 (~9GB), verify ASR and TTS remain loaded (~15GB total)
  3. Large LLM — load GPT-OSS-20B (~13GB), verify ASR and TTS are evicted. Next ASR request evicts LLM first.
  4. Model swapping — switch between two LLMs, verify second loads and first is evicted

Phase 4: Performance Tests

  1. Transformers GPU vs CPU — for each transformers-backed physical model, run same prompt on GPU and CPU, verify GPU is at least 5x faster. Requires admin test endpoint or CLI tool to force CPU execution.
    • Qwen3.5-9B-FP8
    • Qwen3.5-4B
    • gpt-oss-20b
    • gpt-oss-20b-uncensored
    • cohere-transcribe
  2. llama-cpp-python GPU vs CPU — run inference for Qwen3.5-9B-FP8-Uncensored with n_gpu_layers=-1 (GPU) and n_gpu_layers=0 (CPU), verify GPU is at least 5x faster. Same admin test endpoint.
  3. Chatterbox performance — run TTS synthesis, verify audio generation time is reasonable relative to audio duration.

Manual Steps

These require human action and cannot be automated:

  • DNS setup for kidirekt.kischdle.com (during implementation)
  • HuggingFace terms for cohere-transcribe: accepted 2026-04-03
  • HuggingFace token configured at ~/.cache/huggingface/token (done for user tlg, needs setup for user llm during deployment)
  • Open WebUI admin configuration (connections, audio settings)