SHA256

Files

tlg bd0ed74d32 Clarify VRAM eviction rule for cross-priority edge case

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-03 13:20:53 +02:00

19 KiB

Raw Blame History

llmux Design Specification

Overview

llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system.

Hardware Constraints

GPU: NVIDIA RTX 5070 Ti, 16GB VRAM, compute capability 12.0 (Blackwell/SM12.0)
CPU: AMD Ryzen 9 9900X
RAM: 64GB DDR5
Storage: ~1.3TB free on /home
OS: Debian 12 (Bookworm)
NVIDIA driver: 590.48 (CUDA 13.1 capable)
Host CUDA toolkit: 12.8

Architecture

Single Process Design

llmux is a monolithic FastAPI application. One Python process handles all model loading/unloading, VRAM management, and inference routing. This keeps the system simple and gives full control over GPU memory.

Runtimes

Three inference runtimes coexist within the single process:

Runtime	Purpose	Models
transformers (HuggingFace)	HF safetensors models	Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe
llama-cpp-python	GGUF models	Qwen3.5-9B-FP8-Uncensored
chatterbox	TTS	Chatterbox-Turbo, Chatterbox-Multilingual, Chatterbox

Why transformers (not vLLM)

vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). The PyTorch transformers stack works with PyTorch 2.7+ and CUDA 12.8+ on SM12.0. vLLM can be reconsidered once SM12.0 support matures.

Physical Models

ID	Type	Backend	HuggingFace / Source	Estimated VRAM	Vision	Tools
qwen3.5-9b-fp8	LLM	transformers	lovedheart/Qwen3.5-9B-FP8	~9GB	yes	yes
qwen3.5-9b-fp8-uncensored	LLM	llamacpp	HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 GGUF + mmproj GGUF)	~9GB	yes	yes
qwen3.5-4b	LLM	transformers	Qwen/Qwen3.5-4B	~4GB	yes	yes
gpt-oss-20b	LLM	transformers	openai/gpt-oss-20b (MXFP4 quantized MoE, designed for 16GB VRAM)	~13GB	no	yes
gpt-oss-20b-uncensored	LLM	transformers	aoxo/gpt-oss-20b-uncensored	~13GB	no	yes
cohere-transcribe	ASR	transformers	CohereLabs/cohere-transcribe-03-2026 (gated, terms accepted)	~4GB	n/a	n/a
chatterbox-turbo	TTS	chatterbox	resemble-ai/chatterbox (turbo variant)	~2GB	n/a	n/a
chatterbox-multilingual	TTS	chatterbox	resemble-ai/chatterbox (multilingual variant)	~2GB	n/a	n/a
chatterbox	TTS	chatterbox	resemble-ai/chatterbox (default variant)	~2GB	n/a	n/a

Virtual Models

Virtual models are what Open WebUI and API clients see. Multiple virtual models can map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model has zero VRAM cost.

Virtual Model Name	Physical Model	Behavior
Qwen3.5-9B-FP8-Thinking	qwen3.5-9b-fp8	Thinking enabled (default Qwen3.5 behavior)
Qwen3.5-9B-FP8-Instruct	qwen3.5-9b-fp8	enable_thinking=False
Qwen3.5-9B-FP8-Uncensored-Thinking	qwen3.5-9b-fp8-uncensored	Thinking enabled
Qwen3.5-9B-FP8-Uncensored-Instruct	qwen3.5-9b-fp8-uncensored	enable_thinking=False
Qwen3.5-4B-Thinking	qwen3.5-4b	Thinking enabled
Qwen3.5-4B-Instruct	qwen3.5-4b	enable_thinking=False
GPT-OSS-20B-Low	gpt-oss-20b	System prompt prefix: "Reasoning: low"
GPT-OSS-20B-Medium	gpt-oss-20b	System prompt prefix: "Reasoning: medium"
GPT-OSS-20B-High	gpt-oss-20b	System prompt prefix: "Reasoning: high"
GPT-OSS-20B-Uncensored-Low	gpt-oss-20b-uncensored	System prompt prefix: "Reasoning: low"
GPT-OSS-20B-Uncensored-Medium	gpt-oss-20b-uncensored	System prompt prefix: "Reasoning: medium"
GPT-OSS-20B-Uncensored-High	gpt-oss-20b-uncensored	System prompt prefix: "Reasoning: high"
cohere-transcribe	cohere-transcribe	ASR (used via /v1/audio/transcriptions)
Chatterbox-Turbo	chatterbox-turbo	TTS (used via /v1/audio/speech)
Chatterbox-Multilingual	chatterbox-multilingual	TTS
Chatterbox	chatterbox	TTS

VRAM Manager

Preemption Policy

Models remain loaded until VRAM is needed for another model. No idle timeout — a model stays in VRAM indefinitely until evicted.

Priority (highest to lowest)

ASR (cohere-transcribe) — highest priority, evicted only as last resort
TTS (one Chatterbox variant at a time)
LLM (one at a time) — lowest priority, evicted first

Loading Algorithm

When a request arrives for a model whose physical model is not loaded:

If the physical model is already loaded, proceed immediately.
If it fits in available VRAM, load alongside existing models.
If it doesn't fit, evict models by priority (lowest first) until enough VRAM is free:
- Evict LLM first
- Evict TTS second
- Evict ASR only as last resort
- Never evict a higher-priority model to load a lower-priority one (e.g., never evict ASR to make room for TTS; in that case, evict the LLM instead)
Load the requested model.

Concurrency

An asyncio Lock ensures only one load/unload operation at a time.
Requests arriving during a model swap await the lock.
Inference requests hold a read-lock on their model to prevent eviction mid-inference.

Typical Scenarios

Current State	Request	Action
ASR + Qwen3.5-4B (~8GB)	Chat with Qwen3.5-4B	Proceed, already loaded
ASR + TTS + Qwen3.5-4B (~10GB)	Chat with Qwen3.5-9B-FP8	Evict LLM (4B), load 9B (~9GB). ASR+TTS+9B≈15GB, fits.
ASR + TTS + Qwen3.5-4B (~10GB)	Chat with GPT-OSS-20B	Evict LLM first, then TTS, then ASR if needed. Load gpt-oss-20b alone (~13GB).
GPT-OSS-20B loaded (~13GB)	Transcription request	Evict LLM (gpt-oss-20b). Load ASR (~4GB).
ASR + Qwen3.5-4B (~8GB)	TTS request	Fits (~10GB). Load Chatterbox alongside.

API Endpoints

All endpoints on 127.0.0.1:8081. All /v1/* endpoints require Bearer token authentication.

GET /v1/models

Returns all 16 virtual models in OpenAI format, regardless of what's currently loaded. Users can freely select any model; llmux handles swapping.

POST /v1/chat/completions

OpenAI-compatible chat completions. Accepts model parameter matching a virtual model name. Supports stream: true for SSE streaming. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) are applied transparently. Tool/function calling is passed through to models that support it.

POST /v1/audio/transcriptions

OpenAI Whisper-compatible endpoint. Accepts multipart form with audio file and model parameter. Returns transcript in OpenAI response format. Supports language parameter (required by cohere-transcribe — default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.

POST /v1/audio/speech

OpenAI TTS-compatible endpoint. Accepts JSON with model, input (text), voice (maps to Chatterbox voice/speaker config). Returns audio bytes.

GET /health

Unauthenticated. Returns service status and currently loaded models.

Authentication

All /v1/* endpoints require a Bearer token (Authorization: Bearer <api-key>)
API keys stored in config/api_keys.yaml, mounted read-only into the container
Multiple keys: one per client (Open WebUI, remote Whisper, OpenCode, etc.)
GET /health is unauthenticated for monitoring/readiness probes
Traefik acts purely as a router, no auth on its side

Container & Pod Architecture

Pod

Pod name: llmux_pod
Single container: llmux_ctr
Port: 127.0.0.1:8081:8081
GPU: NVIDIA CDI (--device nvidia.com/gpu=all)
Network: default (no host loopback needed)

Base Image

pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime

Verified compatible with SM12.0 (Blackwell). PyTorch 2.7+ with CUDA 12.8+ supports RTX 5070 Ti. Host driver 590.48 (CUDA 13.1) is backwards compatible.

Dockerfile Layers

System deps: libsndfile, ffmpeg (audio processing)
pip install: FastAPI, uvicorn, transformers (>=5.4.0), llama-cpp-python (CUDA build), chatterbox, soundfile, librosa, sentencepiece, protobuf, PyYAML
Copy llmux application code
Entrypoint: uvicorn llmux.main:app --host 0.0.0.0 --port 8081

Bind Mounts

Host Path	Container Path	Mode
/home/llm/.local/share/llmux_pod/models/	/models	read-only
/home/llm/.local/share/llmux_pod/config/	/config	read-only

Systemd

Managed via create_pod_llmux.sh following the Kischdle pattern: create pod, create container, generate systemd units, enable service.

Application Structure

llmux/
├── Dockerfile
├── requirements.txt
├── config/
│   ├── models.yaml
│   └── api_keys.yaml
├── llmux/
│   ├── main.py              # FastAPI app, startup/shutdown, health endpoint
│   ├── auth.py              # API key validation middleware
│   ├── vram_manager.py      # VRAM tracking, load/unload, eviction logic
│   ├── model_registry.py    # Parse models.yaml, virtual→physical mapping
│   ├── routes/
│   │   ├── models.py        # GET /v1/models
│   │   ├── chat.py          # POST /v1/chat/completions
│   │   ├── transcription.py # POST /v1/audio/transcriptions
│   │   └── speech.py        # POST /v1/audio/speech
│   └── backends/
│       ├── base.py          # Abstract base class for model backends
│       ├── transformers.py  # HuggingFace transformers backend
│       ├── llamacpp.py      # llama-cpp-python backend (GGUF)
│       └── chatterbox.py    # Chatterbox TTS backend
└── scripts/
    ├── download_models.sh   # Pre-download all model weights
    └── create_pod_llmux.sh  # Podman pod creation script

Key Design Decisions

backends/ encapsulates runtime differences. Each backend knows how to load, unload, and run inference. Route handlers are backend-agnostic.
vram_manager.py is the single authority on what's loaded. Route handlers call vram_manager.ensure_loaded(physical_model_id) before inference.
model_registry.py handles virtual-to-physical mapping and injects behavior params (thinking toggle, system prompts) before passing to the backend.
Streaming for chat completions uses FastAPI StreamingResponse with SSE, matching OpenAI streaming format.

Model Downloads

All models are pre-downloaded before the pod is created. The scripts/download_models.sh script runs as user llm and downloads to /home/llm/.local/share/llmux_pod/models/.

Model	Method	Approx Size
lovedheart/Qwen3.5-9B-FP8	huggingface-cli download	~9GB
HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 + mmproj GGUF)	huggingface-cli download (specific files)	~10GB
Qwen/Qwen3.5-4B	huggingface-cli download	~8GB
openai/gpt-oss-20b	huggingface-cli download	~13GB
aoxo/gpt-oss-20b-uncensored	huggingface-cli download	~13GB
CohereLabs/cohere-transcribe-03-2026	huggingface-cli download (gated, terms accepted)	~4GB
resemble-ai/chatterbox (3 variants)	per Chatterbox install docs	~2GB

Total estimated: ~60GB. The script is idempotent (skips existing models). A HuggingFace access token is required for gated models (stored at ~/.cache/huggingface/token).

Open WebUI Configuration

Open WebUI (user wbg, port 8080) connects to llmux:

Connections (Admin > Settings > Connections)

OpenAI API Base URL: http://127.0.0.1:8081/v1
API Key: the key from api_keys.yaml designated for Open WebUI

Audio (Admin > Settings > Audio)

STT Engine: openai
STT OpenAI API Base URL: http://127.0.0.1:8081/v1
STT Model: cohere-transcribe
TTS Engine: openai
TTS OpenAI API Base URL: http://127.0.0.1:8081/v1
TTS Model: Chatterbox-Multilingual
TTS Voice: to be configured based on Chatterbox options

User Experience

Model dropdown lists all 16 virtual models
Chat works on any model selection (with potential swap delay for first request)
Dictation uses cohere-transcribe
Audio playback uses Chatterbox
Voice chat combines ASR, LLM, and TTS

Traefik Routing

New dynamic config at /home/trf/.local/share/traefik_pod/dynamic/llmux.yml:

http:
  routers:
    llmux:
      entryPoints: ["wghttp"]
      rule: "Host(`kidirekt.kischdle.com`)"
      priority: 100
      service: llmux

  services:
    llmux:
      loadBalancer:
        servers:
          - url: "http://10.0.2.2:8081"

Routed through WireGuard VPN entry point
No Traefik-level auth (llmux handles API key auth)
DNS setup for kidirekt.kischdle.com is a manual step

Configuration Files

config/models.yaml

physical_models:
  qwen3.5-9b-fp8:
    type: llm
    backend: transformers
    model_id: "lovedheart/Qwen3.5-9B-FP8"
    estimated_vram_gb: 9
    supports_vision: true
    supports_tools: true

  qwen3.5-9b-fp8-uncensored:
    type: llm
    backend: llamacpp
    model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf"
    mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
    estimated_vram_gb: 9
    supports_vision: true
    supports_tools: true

  qwen3.5-4b:
    type: llm
    backend: transformers
    model_id: "Qwen/Qwen3.5-4B"
    estimated_vram_gb: 4
    supports_vision: true
    supports_tools: true

  gpt-oss-20b:
    type: llm
    backend: transformers
    model_id: "openai/gpt-oss-20b"
    estimated_vram_gb: 13
    supports_vision: false
    supports_tools: true

  gpt-oss-20b-uncensored:
    type: llm
    backend: transformers
    model_id: "aoxo/gpt-oss-20b-uncensored"
    estimated_vram_gb: 13
    supports_vision: false
    supports_tools: true

  cohere-transcribe:
    type: asr
    backend: transformers
    model_id: "CohereLabs/cohere-transcribe-03-2026"
    estimated_vram_gb: 4
    default_language: "en"

  chatterbox-turbo:
    type: tts
    backend: chatterbox
    variant: "turbo"
    estimated_vram_gb: 2

  chatterbox-multilingual:
    type: tts
    backend: chatterbox
    variant: "multilingual"
    estimated_vram_gb: 2

  chatterbox:
    type: tts
    backend: chatterbox
    variant: "default"
    estimated_vram_gb: 2

virtual_models:
  Qwen3.5-9B-FP8-Thinking:
    physical: qwen3.5-9b-fp8
    params: { enable_thinking: true }
  Qwen3.5-9B-FP8-Instruct:
    physical: qwen3.5-9b-fp8
    params: { enable_thinking: false }

  Qwen3.5-9B-FP8-Uncensored-Thinking:
    physical: qwen3.5-9b-fp8-uncensored
    params: { enable_thinking: true }
  Qwen3.5-9B-FP8-Uncensored-Instruct:
    physical: qwen3.5-9b-fp8-uncensored
    params: { enable_thinking: false }

  Qwen3.5-4B-Thinking:
    physical: qwen3.5-4b
    params: { enable_thinking: true }
  Qwen3.5-4B-Instruct:
    physical: qwen3.5-4b
    params: { enable_thinking: false }

  GPT-OSS-20B-Low:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: low" }
  GPT-OSS-20B-Medium:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: medium" }
  GPT-OSS-20B-High:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: high" }

  GPT-OSS-20B-Uncensored-Low:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: low" }
  GPT-OSS-20B-Uncensored-Medium:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: medium" }
  GPT-OSS-20B-Uncensored-High:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: high" }

  cohere-transcribe:
    physical: cohere-transcribe
  Chatterbox-Turbo:
    physical: chatterbox-turbo
  Chatterbox-Multilingual:
    physical: chatterbox-multilingual
  Chatterbox:
    physical: chatterbox

config/api_keys.yaml

api_keys:
  - key: "sk-llmux-openwebui-<generated>"
    name: "Open WebUI"
  - key: "sk-llmux-whisper-<generated>"
    name: "Remote Whisper clients"
  - key: "sk-llmux-opencode-<generated>"
    name: "OpenCode"

Keys generated at deployment time.

Testing & Verification

Phase 1: System Integration (iterative, fix issues before proceeding)

Container build — Dockerfile builds successfully, image contains all dependencies
GPU passthrough — container sees RTX 5070 Ti (nvidia-smi works inside container)
Model mount — container can read model weights from /models
Service startup — llmux starts, port 8081 reachable from host
Open WebUI connection — model list populates in Open WebUI
Traefik routing — kidirekt.kischdle.com routes to llmux (when DNS configured)
Systemd lifecycle — start/stop/restart works, service survives reboot

Phase 2: Functional Tests

Auth — requests without valid API key get 401
Model listing — GET /v1/models returns all 16 virtual models
Chat inference — for each physical LLM, chat via Open WebUI as user "try":
- Qwen3.5-9B-FP8 (Thinking + Instruct)
- Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct)
- Qwen3.5-4B (Thinking + Instruct)
- GPT-OSS-20B (Low, Medium, High)
- GPT-OSS-20B-Uncensored (Low, Medium, High)
Streaming — chat responses stream token-by-token in Open WebUI
ASR — Open WebUI dictation transcribes speech (English and German)
TTS — Open WebUI audio playback speaks text
Vision — image + text prompt to each vision-capable model:
- Qwen3.5-4B
- Qwen3.5-9B-FP8
- Qwen3.5-9B-FP8-Uncensored
Tool usage — verify tool calling for each runtime and tool-capable model:
- Qwen3.5-9B-FP8 (transformers)
- Qwen3.5-9B-FP8-Uncensored (llama-cpp-python)
- GPT-OSS-20B (transformers)
- GPT-OSS-20B-Uncensored (transformers)

Phase 3: VRAM Management Tests

Small LLM — load Qwen3.5-4B (~4GB), verify ASR and TTS remain loaded (~10GB total)
Medium LLM — load Qwen3.5-9B-FP8 (~9GB), verify ASR and TTS remain loaded (~15GB total)
Large LLM — load GPT-OSS-20B (~13GB), verify ASR and TTS are evicted. Next ASR request evicts LLM first.
Model swapping — switch between two LLMs, verify second loads and first is evicted

Phase 4: Performance Tests

Transformers GPU vs CPU — for each transformers-backed physical model, run same prompt on GPU and CPU, verify GPU is at least 5x faster. Requires admin test endpoint or CLI tool to force CPU execution.
- Qwen3.5-9B-FP8
- Qwen3.5-4B
- gpt-oss-20b
- gpt-oss-20b-uncensored
- cohere-transcribe
llama-cpp-python GPU vs CPU — run inference for Qwen3.5-9B-FP8-Uncensored with n_gpu_layers=-1 (GPU) and n_gpu_layers=0 (CPU), verify GPU is at least 5x faster. Same admin test endpoint.
Chatterbox performance — run TTS synthesis, verify audio generation time is reasonable relative to audio duration.

Manual Steps

These require human action and cannot be automated:

DNS setup for kidirekt.kischdle.com (during implementation)
HuggingFace terms for cohere-transcribe: accepted 2026-04-03
HuggingFace token configured at ~/.cache/huggingface/token (done for user tlg, needs setup for user llm during deployment)
Open WebUI admin configuration (connections, audio settings)

19 KiB Raw Blame History