Prerequisites
Feature Description
Summary
Qwen3.5-35B-A3B (qwen3_5moe architecture) uses DeltaNet-style SSM layers
(ggml_ssm_conv, ggml_ssm_scan) which have no Vulkan compute shader
implementation. This causes the model to silently fall back to CPU for those
ops, producing corrupt hidden state across the GPU↔CPU boundary and making the
model unusable with -ngl 99 on Vulkan backends.
Hardware
- GPU: Intel(R) Arc(TM) 140V GPU (16GB UMA, Arrow Lake)
- Backend: Vulkan (KHR_coopmat available, disabled via GGML_VK_DISABLE_COOPMAT=1)
- OS: Windows 11, llama.cpp build 27 (d903f30), GNU 15.2.0
Steps to Reproduce
./llama-server \
-m Qwen3.5-35B-A3B-Q4_K_M.gguf \
-ngl 99 \
--port 8080
Observed Behavior
- Model loads and offloads all layers to GPU successfully
- First inference produces corrupted / garbage output
- Subsequent calls result in
vk::DeviceLostError crash:
terminate called after throwing an instance of 'vk::DeviceLostError'
what(): vk::Device::getFenceStatus: ErrorDeviceLost
Root Cause
src/models/qwen35.cpp calls build_delta_net_chunking(),
build_delta_net_recurrent(), and build_delta_net_autoregressive(),
which internally use:
ggml_ssm_conv
ggml_ssm_scan
Neither of these ops has a Vulkan kernel in ggml-vulkan.cpp.
The CUDA backend has full implementations. Vulkan silently routes
these to CPU, causing state corruption on the GPU↔CPU boundary
during autoregressive generation.
Contrast: Qwen3-30B-A3B Works Fine
qwen3moe (Qwen3-30B-A3B) is pure attention MoE — no SSM layers —
and works correctly on Vulkan after disabling coopmat:
set GGML_VK_DISABLE_COOPMAT=1
set GGML_VK_DISABLE_COOPMAT2=1
→ Stable at ~27 t/s generation on Intel Arc 140V.
qwen3_5moe (Qwen3.5-35B-A3B) has interleaved DeltaNet SSM layers
and cannot be fixed with env vars — it needs proper Vulkan kernels.
Expected Behavior
ggml_ssm_conv and ggml_ssm_scan should have Vulkan compute shader
implementations analogous to the existing CUDA kernels, allowing
qwen3_5moe to run fully GPU-accelerated on Vulkan backends.
Related
Motivation
Qwen3.5-35B-A3B (architecture: qwen3_5moe) is a highly capable hybrid MoE model
that uses DeltaNet-style SSM layers interleaved with standard attention layers.
These SSM ops (ggml_ssm_conv, ggml_ssm_scan) have no Vulkan compute shader
implementation, making the model completely unusable on Vulkan backends despite
loading successfully and offloading all layers to GPU.
This affects all users running llama.cpp on Vulkan (Intel Arc, AMD, mobile GPUs)
who cannot use CUDA. The model crashes with vk::DeviceLostError on the first
inference call due to corrupt hidden state from CPU↔GPU boundary crossing during
SSM layer computation.
By contrast, Qwen3-30B-A3B (pure attention MoE, qwen3moe architecture) works
correctly on Vulkan after disabling coopmat. The only blocker for qwen3_5moe is
the missing Vulkan kernels for ggml_ssm_conv and ggml_ssm_scan.
Qwen3.5-35B-A3B is one of the best open-weight models available at its size and
is widely used for agentic/coding tasks. Full Vulkan support would unlock it for
the large community of non-CUDA GPU users.
Possible Implementation
No response
Prerequisites
Feature Description
Summary
Qwen3.5-35B-A3B(qwen3_5moearchitecture) uses DeltaNet-style SSM layers(
ggml_ssm_conv,ggml_ssm_scan) which have no Vulkan compute shaderimplementation. This causes the model to silently fall back to CPU for those
ops, producing corrupt hidden state across the GPU↔CPU boundary and making the
model unusable with
-ngl 99on Vulkan backends.Hardware
Steps to Reproduce
Observed Behavior
vk::DeviceLostErrorcrash:Root Cause
src/models/qwen35.cppcallsbuild_delta_net_chunking(),build_delta_net_recurrent(), andbuild_delta_net_autoregressive(),which internally use:
ggml_ssm_convggml_ssm_scanNeither of these ops has a Vulkan kernel in
ggml-vulkan.cpp.The CUDA backend has full implementations. Vulkan silently routes
these to CPU, causing state corruption on the GPU↔CPU boundary
during autoregressive generation.
Contrast: Qwen3-30B-A3B Works Fine
qwen3moe(Qwen3-30B-A3B) is pure attention MoE — no SSM layers —and works correctly on Vulkan after disabling coopmat:
→ Stable at ~27 t/s generation on Intel Arc 140V.
qwen3_5moe(Qwen3.5-35B-A3B) has interleaved DeltaNet SSM layersand cannot be fixed with env vars — it needs proper Vulkan kernels.
Expected Behavior
ggml_ssm_convandggml_ssm_scanshould have Vulkan compute shaderimplementations analogous to the existing CUDA kernels, allowing
qwen3_5moeto run fully GPU-accelerated on Vulkan backends.Related
src/models/qwen35.cpp— DeltaNet layer implementationggml-vulkan.cpp— missingGGML_OP_SSM_CONV/GGML_OP_SSM_SCANdispatchMotivation
Qwen3.5-35B-A3B (architecture: qwen3_5moe) is a highly capable hybrid MoE model
that uses DeltaNet-style SSM layers interleaved with standard attention layers.
These SSM ops (ggml_ssm_conv, ggml_ssm_scan) have no Vulkan compute shader
implementation, making the model completely unusable on Vulkan backends despite
loading successfully and offloading all layers to GPU.
This affects all users running llama.cpp on Vulkan (Intel Arc, AMD, mobile GPUs)
who cannot use CUDA. The model crashes with vk::DeviceLostError on the first
inference call due to corrupt hidden state from CPU↔GPU boundary crossing during
SSM layer computation.
By contrast, Qwen3-30B-A3B (pure attention MoE, qwen3moe architecture) works
correctly on Vulkan after disabling coopmat. The only blocker for qwen3_5moe is
the missing Vulkan kernels for ggml_ssm_conv and ggml_ssm_scan.
Qwen3.5-35B-A3B is one of the best open-weight models available at its size and
is widely used for agentic/coding tasks. Full Vulkan support would unlock it for
the large community of non-CUDA GPU users.
Possible Implementation
No response