fix: Open WebUI integration — Harmony stripping, VRAM eviction, concurrency lock · 3edc055299 - DesTEngSsv006_swd

SHA256

fix: Open WebUI integration — Harmony stripping, VRAM eviction, concurrency lock

- Add harmony.py: strip GPT-OSS-20B analysis/thinking channel from both
  streaming and non-streaming responses (HarmonyStreamFilter + extract_final_text)
- Add per-model asyncio.Lock in llamacpp backend to prevent concurrent C++
  access that caused container segfaults (exit 139)
- Fix chat handler swap for streaming: move inside _stream_generate within
  lock scope (was broken by try/finally running before stream was consumed)
- Filter /v1/models to return only LLM models (hide ASR/TTS from chat dropdown)
- Correct Qwen3.5-4B estimated_vram_gb: 4 → 9 (actual allocation ~8GB)
- Add GPU memory verification after eviction with retry loop in vram_manager
- Add HF_TOKEN_PATH support in main.py for gated model access
- Add /v1/audio/models and /v1/audio/voices discovery endpoints (no auth)
- Add OOM error handling in both backends and chat route
- Add AUDIO_STT_SUPPORTED_CONTENT_TYPES for webm/wav/mp3/ogg
- Add performance test script (scripts/perf_test.py)
- Update tests to match current config (42 tests pass)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This commit is contained in:

tlg

2026-04-08 21:50:39 +02:00

parent 06923d51b4

commit 3edc055299

15 changed files with 634 additions and 74 deletions

									
										2

kischdle/llmux/config/models.yaml
									
												View File
												
				@@ -22,7 +22,7 @@ physical_models:

				    type: llm

				    backend: transformers

				    model_id: "Qwen/Qwen3.5-4B"

				    estimated_vram_gb: 4

				    estimated_vram_gb: 9

				    supports_vision: true

				    supports_tools: true

fix: Open WebUI integration — Harmony stripping, VRAM eviction, concurrency lock

2 kischdle/llmux/config/models.yaml Unescape Escape View File

2

kischdle/llmux/config/models.yaml

View File