Skip to content

Eval bug: Qwen3.5-35B-A3B produces garbage output on Vulkan backend for specific prompts (Intel Arc MTL iGPU), correct on CPU #21888

@spf1983

Description

@spf1983

Summary

On the Intel Arc Graphics (Meteor Lake, Xe-LPG) integrated GPU with Mesa ANV 25.2.8, the qwen3_5moe / Qwen3.5-35B-A3B Q4_K_M model produces valid output for short, high-frequency greedy completions (e.g. "The capital of France is""Paris"), but emits deterministic repeating-punctuation garbage (!"!"""...) for other prompts — notably anything with "Fourier transform" — while the same model + same quant + same prompt on CPU (-ngl 0) works correctly. The GGML_VK_DISABLE_F16=1 workaround from #18969 does not fix this.

Accompanying symptom: when the prompt reproduces the bug, prompt-eval drops from ~13 tok/s to ~0.48 tok/s (same slow path that CPU takes), strongly suggesting a specific kernel has a correctness failure / fallback for these token sequences.

This may be related to #19957 (SSM kernels), #20354 (Gated Delta Net shader), #20610 (Qwen3.5-27B garbage since b8184), or #18969 (f16 acc), but is distinct from all of them in behavior. Reporting separately in case it's a residual GDN / hybrid-memory path bug.

Name and Version

version: 1 (e21cdc1)
built with GNU 13.3.0 for Linux x86_64

Built from commit e21cdc1 (tip of master at the time of testing, which already contains ssm_conv.comp, ssm_scan.comp, gated_delta_net.comp shaders).

Build command:

cmake -B build -DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc) --target llama-cli llama-bench llama-simple

Operating systems

Linux

GGML backends

Vulkan

Hardware

  • CPU: Intel Core Ultra 5 125H (Meteor Lake, 18 threads)
  • iGPU: Intel Arc Graphics (MTL), PCI [8086:7d55], Xe-LPG
    • Kernel driver: i915 (xe also loaded)
    • Mesa 25.2.8-0ubuntu0.24.04.1 (ANV / Intel open-source Mesa driver)
    • Vulkan API 1.4.318, driver 25.2.8
    • uma: 1, fp16: 1, bf16: 0, warp size 32, no matrix cores
  • RAM: 32 GiB DDR5-4800 (2×16G)
  • Kernel: 6.8.1-t6-generic (Ubuntu 24.04 derivative)
  • /dev/dri/renderD128 accessible (user in render + video groups)

vulkaninfo --summary excerpt:

GPU0:
  apiVersion         = 1.4.318
  driverVersion      = 25.2.8
  vendorID           = 0x8086
  deviceID           = 0x7d55
  deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
  deviceName         = Intel(R) Arc(tm) Graphics (MTL)
  driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
  driverInfo         = Mesa 25.2.8-0ubuntu0.24.04.1

Models

unsloth/Qwen3.5-35B-A3B-GGUF, file Qwen3.5-35B-A3B-Q4_K_M.gguf

  • Size: 22016023168 bytes (20.49 GiB)
  • SHA256: 3b46d1066bc91cc2d613e3bc22ce691dd77e6f0d33c9060690d24ce6de494375
  • llama.cpp architecture label: qwen35moe 35B.A3B Q4_K - Medium, 34.66 B params
  • Loader reports use of llama_memory_recurrent, fused Gated Delta Net (autoregressive/chunked) enabled

Reproduction

Step 1 — Load the model and confirm Vulkan full offload works

$ build/bin/llama-bench -m Qwen3.5-35B-A3B-Q4_K_M.gguf -p 64 -n 32 -ngl 99 -fa 1
ggml_vulkan: 0 = Intel(R) Arc(tm) Graphics (MTL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0
| model                              |    size |  params | backend | ngl | fa | test | t/s              |
| qwen35moe 35B.A3B Q4_K - Medium    | 20.49 GiB | 34.66 B | Vulkan |  99 |  1 | pp64 | 28.32 ± 0.95     |
| qwen35moe 35B.A3B Q4_K - Medium    | 20.49 GiB | 34.66 B | Vulkan |  99 |  1 | tg32 |  6.02 ± 0.01     |

Step 2 — Working case (Vulkan, -ngl 99)

$ build/bin/llama-simple -m Qwen3.5-35B-A3B-Q4_K_M.gguf -n 40 -ngl 99 "The capital of France is"

Output (correct):

The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
...

Perf: prompt 13.04 tok/s, gen 6.03 tok/s.

Also working on Vulkan:

  • "北京的著名景点有""故宫、天坛、颐和园、长城..."
  • "def fibonacci(n):" → valid Python continuation
  • "Quantum computing is" → coherent English prose

Step 3 — Failing case (Vulkan, -ngl 99), deterministic garbage

$ build/bin/llama-simple -m Qwen3.5-35B-A3B-Q4_K_M.gguf -n 40 -ngl 99 "Question: What is the Fourier transform? Answer:"

Output (garbage, reproduced 3+ times):

Question: What is the Fourier transform? Answer:!"!!!!!!!!!"""""""""""""""""""""""""""""

Perf:

  • prompt eval: 0.48 tok/s / 2075 ms per token (vs ~13 tok/s on the working prompt — a ~27× regression)
  • gen eval: 6.00 tok/s (normal)

Also fails on Vulkan with the same garbage pattern:

  • "用一句话解释什么是傅里叶变换:" (Chinese)
  • "Question: What is the Fourier transform? Answer in one sentence."

Step 4 — Control: CPU (-ngl 0) with the same failing prompt

$ build/bin/llama-simple -m Qwen3.5-35B-A3B-Q4_K_M.gguf -n 20 -ngl 0 "Question: What is the Fourier transform? Answer:"

Output (correct):

Question: What is the Fourier transform? Answer: The Fourier transform is a mathematical operation that decomposes a function into its constituent frequencies. It is used

Perf: 0.39 tok/s decode (expected for 35B-A3B on 18-thread CPU).

Same binary, same model, same quant, same prompt — only the -ngl value differs. CPU produces correct output; Vulkan produces garbage.

Attempted workarounds (all failed to fix this bug)

GGML_VK_DISABLE_F16=1

$ GGML_VK_DISABLE_F16=1 build/bin/llama-simple -m ... -n 40 -ngl 99 "Question: What is the Fourier transform? Answer:"
Question: What is the Fourier transform? Answer:!!!!!!!!!!!!!!!""""""""!!!!"""""""""""""
  • Device caps correctly report fp16: 0 after setting the env var, confirming it took effect
  • Bench shows pp64 jumps from 28.32 → 41.45 tok/s (46% gain) and fp16 is now 0 — so the env var is being honored
  • The "working" prompts still work; the failing prompts still fail with the same pattern

Additional observations

  1. The bug pattern is deterministic and prompt-specific — same prompt produces same garbage across runs.
  2. The garbage is always !/"/# or similar punctuation tokens (lowest-index ASCII-like vocab IDs), suggesting logits collapse toward a fixed set of tokens — consistent with NaN/Inf or very negative activations propagating through softmax.
  3. The ~27× prompt-eval slowdown on failing prompts (without workaround) is notable — it implies the compute graph for these prompts is taking a different (slow) path, possibly a CPU fallback for a specific op that also corrupts hidden state on the GPU↔CPU boundary.
  4. With GGML_VK_DISABLE_F16=1, the slow prompt-eval path still happens (0.27 tok/s for Fourier prompt), so it's not only f16acc — some kernel is still falling back or stalling.
  5. vulkaninfo shows 3 GPU entries for the same device (GPU0/GPU1/GPU2 with identical UUIDs); llama-cli --list-devices correctly shows a single Vulkan0: Intel(R) Arc(tm) Graphics (MTL) (23711 MiB, 21339 MiB free).
  6. ssm_conv.comp, ssm_scan.comp, and gated_delta_net.comp compute shaders are all present in ggml/src/ggml-vulkan/vulkan-shaders/ in the build tree.

Why I think this is a fresh issue, not a duplicate

Attachments / follow-ups

  • Full llama-simple stderr of the failing run (available on request)
  • Full llama-bench output with and without GGML_VK_DISABLE_F16=1 (shown in repro section)
  • Happy to provide additional logs, run with LLAMA_LOG_DEBUG=1, bisect the specific compute op (e.g. dump tensor state at each layer to narrow down which kernel corrupts output), or test patches on this hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions