Eval bug: Qwen3.5-35B-A3B produces garbage output on Vulkan backend for specific prompts (Intel Arc MTL iGPU), correct on CPU

## Summary

On the Intel Arc Graphics (Meteor Lake, Xe-LPG) integrated GPU with Mesa ANV 25.2.8, the `qwen3_5moe` / Qwen3.5-35B-A3B Q4_K_M model produces valid output for short, high-frequency greedy completions (e.g. *"The capital of France is"* → *"Paris"*), but emits deterministic repeating-punctuation garbage (`!"!"""...`) for other prompts — notably anything with *"Fourier transform"* — while the same model + same quant + same prompt on CPU (`-ngl 0`) works correctly. The `GGML_VK_DISABLE_F16=1` workaround from #18969 does **not** fix this.

Accompanying symptom: when the prompt reproduces the bug, prompt-eval drops from ~13 tok/s to ~0.48 tok/s (same slow path that CPU takes), strongly suggesting a specific kernel has a correctness failure / fallback for these token sequences.

This may be related to #19957 (SSM kernels), #20354 (Gated Delta Net shader), #20610 (Qwen3.5-27B garbage since b8184), or #18969 (f16 acc), but is distinct from all of them in behavior. Reporting separately in case it's a residual GDN / hybrid-memory path bug.

## Name and Version

```
version: 1 (e21cdc1)
built with GNU 13.3.0 for Linux x86_64
```

Built from commit `e21cdc1` (tip of master at the time of testing, which already contains `ssm_conv.comp`, `ssm_scan.comp`, `gated_delta_net.comp` shaders).

Build command:
```
cmake -B build -DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc) --target llama-cli llama-bench llama-simple
```

## Operating systems

Linux

## GGML backends

Vulkan

## Hardware

- CPU: Intel Core Ultra 5 125H (Meteor Lake, 18 threads)
- iGPU: **Intel Arc Graphics (MTL)**, PCI `[8086:7d55]`, Xe-LPG
  - Kernel driver: `i915` (xe also loaded)
  - Mesa **25.2.8-0ubuntu0.24.04.1** (ANV / Intel open-source Mesa driver)
  - Vulkan API **1.4.318**, driver **25.2.8**
  - `uma: 1`, `fp16: 1`, `bf16: 0`, warp size 32, no matrix cores
- RAM: 32 GiB DDR5-4800 (2×16G)
- Kernel: `6.8.1-t6-generic` (Ubuntu 24.04 derivative)
- `/dev/dri/renderD128` accessible (user in `render` + `video` groups)

`vulkaninfo --summary` excerpt:
```
GPU0:
  apiVersion         = 1.4.318
  driverVersion      = 25.2.8
  vendorID           = 0x8086
  deviceID           = 0x7d55
  deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
  deviceName         = Intel(R) Arc(tm) Graphics (MTL)
  driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
  driverInfo         = Mesa 25.2.8-0ubuntu0.24.04.1
```

## Models

**unsloth/Qwen3.5-35B-A3B-GGUF**, file `Qwen3.5-35B-A3B-Q4_K_M.gguf`
- Size: `22016023168` bytes (20.49 GiB)
- SHA256: `3b46d1066bc91cc2d613e3bc22ce691dd77e6f0d33c9060690d24ce6de494375`
- llama.cpp architecture label: `qwen35moe 35B.A3B Q4_K - Medium`, `34.66 B` params
- Loader reports use of `llama_memory_recurrent`, `fused Gated Delta Net (autoregressive/chunked) enabled`

## Reproduction

### Step 1 — Load the model and confirm Vulkan full offload works

```bash
$ build/bin/llama-bench -m Qwen3.5-35B-A3B-Q4_K_M.gguf -p 64 -n 32 -ngl 99 -fa 1
ggml_vulkan: 0 = Intel(R) Arc(tm) Graphics (MTL) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0
| model                              |    size |  params | backend | ngl | fa | test | t/s              |
| qwen35moe 35B.A3B Q4_K - Medium    | 20.49 GiB | 34.66 B | Vulkan |  99 |  1 | pp64 | 28.32 ± 0.95     |
| qwen35moe 35B.A3B Q4_K - Medium    | 20.49 GiB | 34.66 B | Vulkan |  99 |  1 | tg32 |  6.02 ± 0.01     |
```

### Step 2 — Working case (Vulkan, `-ngl 99`)

```bash
$ build/bin/llama-simple -m Qwen3.5-35B-A3B-Q4_K_M.gguf -n 40 -ngl 99 "The capital of France is"
```
Output (correct):
```
The capital of France is Paris.
The capital of France is Paris.
The capital of France is Paris.
...
```
Perf: prompt 13.04 tok/s, gen 6.03 tok/s.

Also working on Vulkan:
- `"北京的著名景点有"` → `"故宫、天坛、颐和园、长城..."`
- `"def fibonacci(n):"` → valid Python continuation
- `"Quantum computing is"` → coherent English prose

### Step 3 — Failing case (Vulkan, `-ngl 99`), deterministic garbage

```bash
$ build/bin/llama-simple -m Qwen3.5-35B-A3B-Q4_K_M.gguf -n 40 -ngl 99 "Question: What is the Fourier transform? Answer:"
```
Output (garbage, reproduced 3+ times):
```
Question: What is the Fourier transform? Answer:!"!!!!!!!!!"""""""""""""""""""""""""""""
```
Perf:
- prompt eval: **0.48 tok/s / 2075 ms per token** (vs ~13 tok/s on the working prompt — a ~27× regression)
- gen eval: 6.00 tok/s (normal)

Also fails on Vulkan with the same garbage pattern:
- `"用一句话解释什么是傅里叶变换:"` (Chinese)
- `"Question: What is the Fourier transform? Answer in one sentence."`

### Step 4 — Control: CPU (`-ngl 0`) with the same failing prompt

```bash
$ build/bin/llama-simple -m Qwen3.5-35B-A3B-Q4_K_M.gguf -n 20 -ngl 0 "Question: What is the Fourier transform? Answer:"
```
Output (correct):
```
Question: What is the Fourier transform? Answer: The Fourier transform is a mathematical operation that decomposes a function into its constituent frequencies. It is used
```
Perf: 0.39 tok/s decode (expected for 35B-A3B on 18-thread CPU).

→ **Same binary, same model, same quant, same prompt — only the `-ngl` value differs**. CPU produces correct output; Vulkan produces garbage.

## Attempted workarounds (all failed to fix this bug)

### `GGML_VK_DISABLE_F16=1`
```bash
$ GGML_VK_DISABLE_F16=1 build/bin/llama-simple -m ... -n 40 -ngl 99 "Question: What is the Fourier transform? Answer:"
Question: What is the Fourier transform? Answer:!!!!!!!!!!!!!!!""""""""!!!!"""""""""""""
```
- Device caps correctly report `fp16: 0` after setting the env var, confirming it took effect
- Bench shows pp64 jumps from 28.32 → **41.45** tok/s (46% gain) and fp16 is now 0 — so the env var is being honored
- The "working" prompts still work; the failing prompts **still fail** with the same pattern

## Additional observations

1. The bug pattern is **deterministic and prompt-specific** — same prompt produces same garbage across runs.
2. The garbage is always `!`/`"`/`#` or similar punctuation tokens (lowest-index ASCII-like vocab IDs), suggesting logits collapse toward a fixed set of tokens — consistent with NaN/Inf or very negative activations propagating through softmax.
3. The ~27× prompt-eval slowdown on failing prompts (without workaround) is notable — it implies the compute graph for these prompts is taking a different (slow) path, possibly a CPU fallback for a specific op that also corrupts hidden state on the GPU↔CPU boundary.
4. With `GGML_VK_DISABLE_F16=1`, the slow prompt-eval path still happens (0.27 tok/s for Fourier prompt), so it's not only f16acc — some kernel is still falling back or stalling.
5. `vulkaninfo` shows 3 GPU entries for the same device (GPU0/GPU1/GPU2 with identical UUIDs); `llama-cli --list-devices` correctly shows a single `Vulkan0: Intel(R) Arc(tm) Graphics (MTL) (23711 MiB, 21339 MiB free)`.
6. `ssm_conv.comp`, `ssm_scan.comp`, and `gated_delta_net.comp` compute shaders are all present in `ggml/src/ggml-vulkan/vulkan-shaders/` in the build tree.

## Why I think this is a fresh issue, not a duplicate
- #18969 (f16 acc NaN on embeddings): the `GGML_VK_DISABLE_F16` workaround explicitly fixed that issue. It does **not** fix this one.
- #19957 (missing SSM kernels for qwen3_5moe): reported resolved by recompiling to a version with the shaders — our build has the shaders and model does load + run, but with silent correctness failure on specific prompts.
- #20354 (GDN Vulkan compute shader): closed as dup; my build has `gated_delta_net.comp`.
- #20610 (Qwen3.5-27B garbage on Vulkan since b8184): similar symptom family, different model, unresolved; might share root cause.

## Attachments / follow-ups
- Full `llama-simple` stderr of the failing run (available on request)
- Full `llama-bench` output with and without `GGML_VK_DISABLE_F16=1` (shown in repro section)
- Happy to provide additional logs, run with `LLAMA_LOG_DEBUG=1`, bisect the specific compute op (e.g. dump tensor state at each layer to narrow down which kernel corrupts output), or test patches on this hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Qwen3.5-35B-A3B produces garbage output on Vulkan backend for specific prompts (Intel Arc MTL iGPU), correct on CPU #21888

Summary

Name and Version

Operating systems

GGML backends

Hardware

Models

Reproduction

Step 1 — Load the model and confirm Vulkan full offload works

Step 2 — Working case (Vulkan, `-ngl 99`)

Step 3 — Failing case (Vulkan, `-ngl 99`), deterministic garbage

Step 4 — Control: CPU (`-ngl 0`) with the same failing prompt

Attempted workarounds (all failed to fix this bug)

`GGML_VK_DISABLE_F16=1`

Additional observations

Why I think this is a fresh issue, not a duplicate

Attachments / follow-ups

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Qwen3.5-35B-A3B produces garbage output on Vulkan backend for specific prompts (Intel Arc MTL iGPU), correct on CPU #21888

Description

Summary

Name and Version

Operating systems

GGML backends

Hardware

Models

Reproduction

Step 1 — Load the model and confirm Vulkan full offload works

Step 2 — Working case (Vulkan, -ngl 99)

Step 3 — Failing case (Vulkan, -ngl 99), deterministic garbage

Step 4 — Control: CPU (-ngl 0) with the same failing prompt

Attempted workarounds (all failed to fix this bug)

GGML_VK_DISABLE_F16=1

Additional observations

Why I think this is a fresh issue, not a duplicate

Attachments / follow-ups

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Step 2 — Working case (Vulkan, `-ngl 99`)

Step 3 — Failing case (Vulkan, `-ngl 99`), deterministic garbage

Step 4 — Control: CPU (`-ngl 0`) with the same failing prompt

`GGML_VK_DISABLE_F16=1`