PR #19504 (merged in b8233) added the fused GGML_OP_GATED_DELTA_NET operation with CPU and CUDA backends.
On AMD hardware (Strix Halo, RDNA 3.5, gfx1151), both non-CUDA paths perform poorly:
- Vulkan: No
GATED_DELTA_NET compute shader exists — GDN ops fall back to CPU
- ROCm/HIP: The CUDA kernel cross-compiles and runs on GPU (no CPU fallback), but achieves the same ~12 t/s as the CPU path — suggesting the kernel is not effective on RDNA 3.5
Benchmarks
Hardware: AMD Ryzen AI Max+ 395, Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs), 128 GB LPDDR5X-8000 unified memory (~256 GB/s bandwidth)
Qwen3.5-27B Q4_K_M (16 GB, GatedDeltaNet architecture)
| Backend |
Build |
pp512 (t/s) |
tg128 (t/s) |
GDN execution |
| Vulkan (RADV Mesa 26.0.1) |
b8234 |
284.4 |
11.87 |
CPU fallback (no Vulkan shader) |
| ROCm 7.2 (HIP) |
b8234 |
330.5 |
11.81 |
GPU (fused kernel via HIP) |
Reference: Qwen3-Coder-Next UD-Q4_K_XL (42 GB, standard attention, no GDN)
| Backend |
Build |
pp512 (t/s) |
tg128 (t/s) |
| Vulkan |
b8234 |
633 |
47.1 |
| ROCm 7.2 |
b8234 |
665 |
44.4 |
The 16 GB GDN model is 4x slower than a 42 GB non-GDN model on the same hardware. Based on model size and memory bandwidth, Qwen3.5-27B should theoretically achieve 50-80 t/s with proper GPU-accelerated GDN.
Analysis
Vulkan — straightforward: no shader exists, GDN falls back to CPU. Needs a dedicated Vulkan compute shader similar to how SSM_CONV/SSM_SCAN were added (#19957).
ROCm/HIP — more subtle. The fused kernel from #19504 compiles and runs on GPU via HIP (verified: no "fused Gated Delta Net not supported" warning, supports_op returns true). However, performance is identical to CPU fallback. Possible causes:
Additional data point: pp1 = 11.73 t/s (single token prompt processing equals TG speed), confirming the bottleneck is kernel execution overhead, not memory bandwidth.
Affected models
All models using GatedDeltaNet architecture:
- Qwen3.5-27B (72.4% SWE-bench Verified, 16 GB — would be excellent for local inference)
- Qwen3.5-35B-A3B
- Qwen3.5-122B-A10B
- Community fine-tunes (e.g. Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled)
These models are effectively unusable on AMD hardware despite their small size and strong benchmark results.
Context
Environment
- CPU/GPU: AMD Ryzen AI Max+ 395 / Radeon 8060S (gfx1151, RDNA 3.5, integrated)
- Vulkan driver: RADV (Mesa 26.0.1)
- ROCm: 7.2
- RAM: 128 GB LPDDR5X-8000 (unified, ~256 GB/s)
- llama.cpp: b8234 (commit 213c4a0)
PR #19504 (merged in b8233) added the fused
GGML_OP_GATED_DELTA_NEToperation with CPU and CUDA backends.On AMD hardware (Strix Halo, RDNA 3.5, gfx1151), both non-CUDA paths perform poorly:
GATED_DELTA_NETcompute shader exists — GDN ops fall back to CPUBenchmarks
Hardware: AMD Ryzen AI Max+ 395, Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs), 128 GB LPDDR5X-8000 unified memory (~256 GB/s bandwidth)
Qwen3.5-27B Q4_K_M (16 GB, GatedDeltaNet architecture)
Reference: Qwen3-Coder-Next UD-Q4_K_XL (42 GB, standard attention, no GDN)
The 16 GB GDN model is 4x slower than a 42 GB non-GDN model on the same hardware. Based on model size and memory bandwidth, Qwen3.5-27B should theoretically achieve 50-80 t/s with proper GPU-accelerated GDN.
Analysis
Vulkan — straightforward: no shader exists, GDN falls back to CPU. Needs a dedicated Vulkan compute shader similar to how
SSM_CONV/SSM_SCANwere added (#19957).ROCm/HIP — more subtle. The fused kernel from #19504 compiles and runs on GPU via HIP (verified: no "fused Gated Delta Net not supported" warning,
supports_opreturns true). However, performance is identical to CPU fallback. Possible causes:float s[S_v]per thread (up to 128 floats = 512 bytes of registers). On RDNA 3.5 with Wave Size 32, this likely causes register spilling to VRAM/LDShipMemcpyWithStreambottleneck on gfx1151: gfx1151 (Strix Halo) — LLM decode is ~90% hipMemcpyWithStream in FP16 & 4-bit; kernels not compute-bound pytorch/pytorch#171687 documents that 92-95% of decode time is spent inhipMemcpyWithStreamfor models >15 GB on this architectureAdditional data point:
pp1 = 11.73 t/s(single token prompt processing equals TG speed), confirming the bottleneck is kernel execution overhead, not memory bandwidth.Affected models
All models using GatedDeltaNet architecture:
These models are effectively unusable on AMD hardware despite their small size and strong benchmark results.
Context
SSM_CONV,SSM_SCAN) already have Vulkan shaders (Feat/Bug: Vulkan backend missing ggml_ssm_conv / ggml_ssm_scan kernels — Qwen3.5-35B-A3B (qwen3_5moe) CPU-only on Vulkan #19957).GATED_DELTA_NETis a separate, newer op.Environment