Vulkan: missing GATED_DELTA_NET compute shader; ROCm/HIP: fused kernel underperforms on RDNA 3.5 (gfx1151)

PR #19504 (merged in b8233) added the fused `GGML_OP_GATED_DELTA_NET` operation with CPU and CUDA backends.

On AMD hardware (Strix Halo, RDNA 3.5, gfx1151), both non-CUDA paths perform poorly:

1. **Vulkan**: No `GATED_DELTA_NET` compute shader exists — GDN ops fall back to CPU
2. **ROCm/HIP**: The CUDA kernel cross-compiles and runs on GPU (no CPU fallback), but achieves the **same ~12 t/s** as the CPU path — suggesting the kernel is not effective on RDNA 3.5

## Benchmarks

**Hardware:** AMD Ryzen AI Max+ 395, Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs), 128 GB LPDDR5X-8000 unified memory (~256 GB/s bandwidth)

### Qwen3.5-27B Q4_K_M (16 GB, GatedDeltaNet architecture)

| Backend | Build | pp512 (t/s) | tg128 (t/s) | GDN execution |
|---------|-------|-------------|-------------|---------------|
| Vulkan (RADV Mesa 26.0.1) | b8234 | 284.4 | **11.87** | CPU fallback (no Vulkan shader) |
| ROCm 7.2 (HIP) | b8234 | 330.5 | **11.81** | GPU (fused kernel via HIP) |

### Reference: Qwen3-Coder-Next UD-Q4_K_XL (42 GB, standard attention, no GDN)

| Backend | Build | pp512 (t/s) | tg128 (t/s) |
|---------|-------|-------------|-------------|
| Vulkan | b8234 | 633 | **47.1** |
| ROCm 7.2 | b8234 | 665 | **44.4** |

The 16 GB GDN model is **4x slower** than a 42 GB non-GDN model on the same hardware. Based on model size and memory bandwidth, Qwen3.5-27B should theoretically achieve **50-80 t/s** with proper GPU-accelerated GDN.

## Analysis

**Vulkan** — straightforward: no shader exists, GDN falls back to CPU. Needs a dedicated Vulkan compute shader similar to how `SSM_CONV`/`SSM_SCAN` were added (#19957).

**ROCm/HIP** — more subtle. The fused kernel from #19504 compiles and runs on GPU via HIP (verified: no "fused Gated Delta Net not supported" warning, `supports_op` returns true). However, performance is identical to CPU fallback. Possible causes:

- **Register pressure**: The kernel uses `float s[S_v]` per thread (up to 128 floats = 512 bytes of registers). On RDNA 3.5 with Wave Size 32, this likely causes register spilling to VRAM/LDS
- **Known `hipMemcpyWithStream` bottleneck on gfx1151**: pytorch/pytorch#171687 documents that 92-95% of decode time is spent in `hipMemcpyWithStream` for models >15 GB on this architecture
- **Kernel was designed and tuned for NVIDIA CUDA** — warp size, register file, memory hierarchy all differ on RDNA

Additional data point: `pp1 = 11.73 t/s` (single token prompt processing equals TG speed), confirming the bottleneck is kernel execution overhead, not memory bandwidth.

## Affected models

All models using GatedDeltaNet architecture:
- **Qwen3.5-27B** (72.4% SWE-bench Verified, 16 GB — would be excellent for local inference)
- Qwen3.5-35B-A3B
- Qwen3.5-122B-A10B
- Community fine-tunes (e.g. Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled)

These models are effectively unusable on AMD hardware despite their small size and strong benchmark results.

## Context

- PR #19504 added CPU + CUDA kernels. Metal was mentioned by @ggerganov as planned.
- @jeffbolznv asked about Vulkan plans in #19504 — no response yet.
- The SSM ops (`SSM_CONV`, `SSM_SCAN`) already have Vulkan shaders (#19957). `GATED_DELTA_NET` is a separate, newer op.
- UMA systems (AMD Strix Halo, Apple Silicon via MoltenVK) are particularly affected — Vulkan is the primary high-performance backend for these platforms.

## Environment

- CPU/GPU: AMD Ryzen AI Max+ 395 / Radeon 8060S (gfx1151, RDNA 3.5, integrated)
- Vulkan driver: RADV (Mesa 26.0.1)
- ROCm: 7.2
- RAM: 128 GB LPDDR5X-8000 (unified, ~256 GB/s)
- llama.cpp: b8234 (commit 213c4a0b8)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: missing GATED_DELTA_NET compute shader; ROCm/HIP: fused kernel underperforms on RDNA 3.5 (gfx1151) #20354

Benchmarks

Qwen3.5-27B Q4_K_M (16 GB, GatedDeltaNet architecture)

Reference: Qwen3-Coder-Next UD-Q4_K_XL (42 GB, standard attention, no GDN)

Analysis

Affected models

Context

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backend	Build	pp512 (t/s)	tg128 (t/s)	GDN execution
Vulkan (RADV Mesa 26.0.1)	b8234	284.4	11.87	CPU fallback (no Vulkan shader)
ROCm 7.2 (HIP)	b8234	330.5	11.81	GPU (fused kernel via HIP)

Vulkan: missing GATED_DELTA_NET compute shader; ROCm/HIP: fused kernel underperforms on RDNA 3.5 (gfx1151) #20354

Description

Benchmarks

Qwen3.5-27B Q4_K_M (16 GB, GatedDeltaNet architecture)

Reference: Qwen3-Coder-Next UD-Q4_K_XL (42 GB, standard attention, no GDN)

Analysis

Affected models

Context

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions