Feature Request: Add VK_EXT_memory_priority support for model allocations (Vulkan backend)

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Add support for `VK_EXT_memory_priority` in the Vulkan backend to assign **high priority (1.0f)** to model allocations such as model itself, its kv cache. This tells the driver to prefer keeping model weights resident in device-local VRAM under memory pressure.
#### Current behavior:
All Vulkan allocations (weights, KV cache) initially reside in VRAM. When memory pressure occurs and llama.cpp is idle, the driver may evict some of these allocations (weights and/or KV cache) to GTT.
#### Expected behavior:
- Model allocations get `priority=1.0f` -> they are preferred to stay in VRAM
- Desktop apps / other processes with default priority (0.5f when unspecified) are more likely to be evicted first under pressure.
#### Control
- Implement support to control this through flags and/or environment variables separately for both model itself and kv cache if possible
#### References:
- [Vulkan docs about memory priority allocate](https://docs.vulkan.org/refpages/latest/refpages/source/VkMemoryPriorityAllocateInfoEXT.html)
- [GPUOpen guide](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/vk_ext_memory_priority.html)

### Motivation

**Universal problem across GPU sizes:** When llama.cpp runs alongside desktop environments or other apps:
1. Model loads -> weights fully resident in VRAM 
2. Idle period -> switch to desktop/browser
3. Memory pressure -> compositor requests VRAM
4. Driver evicts -> driver sees llama.cpp buffers as "idle", moves allocations to GTT
5. Resume generation -> allocations reload from host, but VRAM still partially occupied -> leading to more data offloaded to GTT and slower generation

This especially affects people with low vram available

`VK_EXT_memory_priority` is designed exactly for this situation: with priority set to 1.0f for critical allocations, the driver can keep model allocations in VRAM and prefer evicting lower‑priority data (which defaults to 0.5f when not specified), reducing eviction‑induced stalls and performance drops.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Add VK_EXT_memory_priority support for model allocations (Vulkan backend) #17605

Prerequisites

Feature Description

Current behavior:

Expected behavior:

Control

References:

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add VK_EXT_memory_priority support for model allocations (Vulkan backend) #17605

Description

Prerequisites

Feature Description

Current behavior:

Expected behavior:

Control

References:

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions