Skip to content

Feat/Bug: Vulkan backend missing ggml_ssm_conv / ggml_ssm_scan kernels — Qwen3.5-35B-A3B (qwen3_5moe) CPU-only on Vulkan #19957

@ai-joe-git

Description

@ai-joe-git

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Summary

Qwen3.5-35B-A3B (qwen3_5moe architecture) uses DeltaNet-style SSM layers
(ggml_ssm_conv, ggml_ssm_scan) which have no Vulkan compute shader
implementation. This causes the model to silently fall back to CPU for those
ops, producing corrupt hidden state across the GPU↔CPU boundary and making the
model unusable with -ngl 99 on Vulkan backends.

Hardware

  • GPU: Intel(R) Arc(TM) 140V GPU (16GB UMA, Arrow Lake)
  • Backend: Vulkan (KHR_coopmat available, disabled via GGML_VK_DISABLE_COOPMAT=1)
  • OS: Windows 11, llama.cpp build 27 (d903f30), GNU 15.2.0

Steps to Reproduce

./llama-server \
  -m Qwen3.5-35B-A3B-Q4_K_M.gguf \
  -ngl 99 \
  --port 8080

Observed Behavior

  • Model loads and offloads all layers to GPU successfully
  • First inference produces corrupted / garbage output
  • Subsequent calls result in vk::DeviceLostError crash:
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::getFenceStatus: ErrorDeviceLost

Root Cause

src/models/qwen35.cpp calls build_delta_net_chunking(),
build_delta_net_recurrent(), and build_delta_net_autoregressive(),
which internally use:

  • ggml_ssm_conv
  • ggml_ssm_scan

Neither of these ops has a Vulkan kernel in ggml-vulkan.cpp.
The CUDA backend has full implementations. Vulkan silently routes
these to CPU, causing state corruption on the GPU↔CPU boundary
during autoregressive generation.

Contrast: Qwen3-30B-A3B Works Fine

qwen3moe (Qwen3-30B-A3B) is pure attention MoE — no SSM layers —
and works correctly on Vulkan after disabling coopmat:

set GGML_VK_DISABLE_COOPMAT=1
set GGML_VK_DISABLE_COOPMAT2=1

→ Stable at ~27 t/s generation on Intel Arc 140V.

qwen3_5moe (Qwen3.5-35B-A3B) has interleaved DeltaNet SSM layers
and cannot be fixed with env vars — it needs proper Vulkan kernels.

Expected Behavior

ggml_ssm_conv and ggml_ssm_scan should have Vulkan compute shader
implementations analogous to the existing CUDA kernels, allowing
qwen3_5moe to run fully GPU-accelerated on Vulkan backends.

Related

Motivation

Qwen3.5-35B-A3B (architecture: qwen3_5moe) is a highly capable hybrid MoE model
that uses DeltaNet-style SSM layers interleaved with standard attention layers.
These SSM ops (ggml_ssm_conv, ggml_ssm_scan) have no Vulkan compute shader
implementation, making the model completely unusable on Vulkan backends despite
loading successfully and offloading all layers to GPU.

This affects all users running llama.cpp on Vulkan (Intel Arc, AMD, mobile GPUs)
who cannot use CUDA. The model crashes with vk::DeviceLostError on the first
inference call due to corrupt hidden state from CPU↔GPU boundary crossing during
SSM layer computation.

By contrast, Qwen3-30B-A3B (pure attention MoE, qwen3moe architecture) works
correctly on Vulkan after disabling coopmat. The only blocker for qwen3_5moe is
the missing Vulkan kernels for ggml_ssm_conv and ggml_ssm_scan.

Qwen3.5-35B-A3B is one of the best open-weight models available at its size and
is widely used for agentic/coding tasks. Full Vulkan support would unlock it for
the large community of non-CUDA GPU users.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions