-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Add support for VK_EXT_memory_priority in the Vulkan backend to assign high priority (1.0f) to model allocations such as model itself, its kv cache. This tells the driver to prefer keeping model weights resident in device-local VRAM under memory pressure.
Current behavior:
All Vulkan allocations (weights, KV cache) initially reside in VRAM. When memory pressure occurs and llama.cpp is idle, the driver may evict some of these allocations (weights and/or KV cache) to GTT.
Expected behavior:
- Model allocations get
priority=1.0f-> they are preferred to stay in VRAM - Desktop apps / other processes with default priority (0.5f when unspecified) are more likely to be evicted first under pressure.
Control
- Implement support to control this through flags and/or environment variables separately for both model itself and kv cache if possible
References:
Motivation
Universal problem across GPU sizes: When llama.cpp runs alongside desktop environments or other apps:
- Model loads -> weights fully resident in VRAM
- Idle period -> switch to desktop/browser
- Memory pressure -> compositor requests VRAM
- Driver evicts -> driver sees llama.cpp buffers as "idle", moves allocations to GTT
- Resume generation -> allocations reload from host, but VRAM still partially occupied -> leading to more data offloaded to GTT and slower generation
This especially affects people with low vram available
VK_EXT_memory_priority is designed exactly for this situation: with priority set to 1.0f for critical allocations, the driver can keep model allocations in VRAM and prefer evicting lower‑priority data (which defaults to 0.5f when not specified), reducing eviction‑induced stalls and performance drops.
Possible Implementation
No response