The original eviction logic blocked ASR eviction even when an LLM
genuinely needed all 16GB VRAM (e.g., gpt-oss-20b at 13GB). Now uses
two-pass eviction: first evicts lower/same priority, then cascades to
higher priority as last resort. Added tests for ASR-survives and
full-cascade scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks GPU VRAM usage (16GB) and handles model loading/unloading with
priority-based eviction: LLM (lowest) -> TTS -> ASR (highest, protected).
Uses asyncio Lock for concurrency safety.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>