Skip to content

Feature Request: Add VK_EXT_memory_priority support for model allocations (Vulkan backend) #17605

@hikki-gd

Description

@hikki-gd

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Add support for VK_EXT_memory_priority in the Vulkan backend to assign high priority (1.0f) to model allocations such as model itself, its kv cache. This tells the driver to prefer keeping model weights resident in device-local VRAM under memory pressure.

Current behavior:

All Vulkan allocations (weights, KV cache) initially reside in VRAM. When memory pressure occurs and llama.cpp is idle, the driver may evict some of these allocations (weights and/or KV cache) to GTT.

Expected behavior:

  • Model allocations get priority=1.0f -> they are preferred to stay in VRAM
  • Desktop apps / other processes with default priority (0.5f when unspecified) are more likely to be evicted first under pressure.

Control

  • Implement support to control this through flags and/or environment variables separately for both model itself and kv cache if possible

References:

Motivation

Universal problem across GPU sizes: When llama.cpp runs alongside desktop environments or other apps:

  1. Model loads -> weights fully resident in VRAM
  2. Idle period -> switch to desktop/browser
  3. Memory pressure -> compositor requests VRAM
  4. Driver evicts -> driver sees llama.cpp buffers as "idle", moves allocations to GTT
  5. Resume generation -> allocations reload from host, but VRAM still partially occupied -> leading to more data offloaded to GTT and slower generation

This especially affects people with low vram available

VK_EXT_memory_priority is designed exactly for this situation: with priority set to 1.0f for critical allocations, the driver can keep model allocations in VRAM and prefer evicting lower‑priority data (which defaults to 0.5f when not specified), reducing eviction‑induced stalls and performance drops.

Possible Implementation

No response

Metadata

Metadata

Assignees

Labels

VulkanIssues specific to the Vulkan backendenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions