Skip to content

Vulkan: missing GATED_DELTA_NET compute shader; ROCm/HIP: fused kernel underperforms on RDNA 3.5 (gfx1151) #20354

@nsyring

Description

@nsyring

PR #19504 (merged in b8233) added the fused GGML_OP_GATED_DELTA_NET operation with CPU and CUDA backends.

On AMD hardware (Strix Halo, RDNA 3.5, gfx1151), both non-CUDA paths perform poorly:

  1. Vulkan: No GATED_DELTA_NET compute shader exists — GDN ops fall back to CPU
  2. ROCm/HIP: The CUDA kernel cross-compiles and runs on GPU (no CPU fallback), but achieves the same ~12 t/s as the CPU path — suggesting the kernel is not effective on RDNA 3.5

Benchmarks

Hardware: AMD Ryzen AI Max+ 395, Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs), 128 GB LPDDR5X-8000 unified memory (~256 GB/s bandwidth)

Qwen3.5-27B Q4_K_M (16 GB, GatedDeltaNet architecture)

Backend Build pp512 (t/s) tg128 (t/s) GDN execution
Vulkan (RADV Mesa 26.0.1) b8234 284.4 11.87 CPU fallback (no Vulkan shader)
ROCm 7.2 (HIP) b8234 330.5 11.81 GPU (fused kernel via HIP)

Reference: Qwen3-Coder-Next UD-Q4_K_XL (42 GB, standard attention, no GDN)

Backend Build pp512 (t/s) tg128 (t/s)
Vulkan b8234 633 47.1
ROCm 7.2 b8234 665 44.4

The 16 GB GDN model is 4x slower than a 42 GB non-GDN model on the same hardware. Based on model size and memory bandwidth, Qwen3.5-27B should theoretically achieve 50-80 t/s with proper GPU-accelerated GDN.

Analysis

Vulkan — straightforward: no shader exists, GDN falls back to CPU. Needs a dedicated Vulkan compute shader similar to how SSM_CONV/SSM_SCAN were added (#19957).

ROCm/HIP — more subtle. The fused kernel from #19504 compiles and runs on GPU via HIP (verified: no "fused Gated Delta Net not supported" warning, supports_op returns true). However, performance is identical to CPU fallback. Possible causes:

Additional data point: pp1 = 11.73 t/s (single token prompt processing equals TG speed), confirming the bottleneck is kernel execution overhead, not memory bandwidth.

Affected models

All models using GatedDeltaNet architecture:

  • Qwen3.5-27B (72.4% SWE-bench Verified, 16 GB — would be excellent for local inference)
  • Qwen3.5-35B-A3B
  • Qwen3.5-122B-A10B
  • Community fine-tunes (e.g. Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled)

These models are effectively unusable on AMD hardware despite their small size and strong benchmark results.

Context

Environment

  • CPU/GPU: AMD Ryzen AI Max+ 395 / Radeon 8060S (gfx1151, RDNA 3.5, integrated)
  • Vulkan driver: RADV (Mesa 26.0.1)
  • ROCm: 7.2
  • RAM: 128 GB LPDDR5X-8000 (unified, ~256 GB/s)
  • llama.cpp: b8234 (commit 213c4a0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions