Skip to content

Misc. [Bug]: Vulkan Backend - NaN on Intel iGPU when injecting Embeddings (f16 acc overflow) #18969

@HaujetZhao

Description

@HaujetZhao

Name and Version

load_backend: loaded RPC backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140T GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 5050 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-cpu-alderlake.dll
version: 7761 (a89002f)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

Problem description & steps to reproduce

Issue Description

When using llama.cpp as a decoder for an ASR (Speech-to-Text) system. Specificlly, Fun-ASR-nano, which uses a Qwen3-0.6B LLM Decoder as audio feature decoder, I came to a idea, using llama.cpp to accelerate it's decoding process.

We are injecting embeddings directly (batch.embd) instead of token IDs.

On Intel iGPUs (specifically Intel Arc 140T) using the Vulkan backend, this operations fails with NaNs/Crashes when the batch size exceeds 8.

This appears to be caused by the Vulkan backend defaulting to F16 Accumulation (f16acc) for mul_mat_mat operations on quantized models, which overflows given the magnitude of the injected audio embeddings.

Environment

  • OS: Windows 10 22H2
  • Hardware: Intel Arc 140T (iGPU)
  • Backend: Vulkan (ggml-vulkan)
  • Model: Qwen3-0.6B (Quantized: fp16)
  • Llama.cpp Version: Release b7783

Reproduction Steps

  1. Initialize a llama_context with a Quantized Model (e.g., fp16).
  2. Prepare a batch of embeddings (batch.embd) representing audio features.
    • Magnitude: Values range from approx -150 to +1200.
    • Sequence Length: ~300 tokens (e.g., 10-20 seconds of audio).
  3. Call llama_decode(ctx, batch).

Observed Behavior

  • If Batch Size > 8 (e.g., 512):
    • The Vulkan backend selects the mul_mat_mat pipeline (Matrix-Matrix multiplication).
    • Result: The output logits become NaN .
    • Cause: The highly optimized mul_mat_mat kernel for quantized types defaults to F16 Accumulation if the device supports F16. The dot product of our embeddings (magnitude ~1000) and weights exceeds the F16 dynamic range (65504), leading to overflow.

Investigation & Evidence

I have verified this hypothesis through extensive testing:

  1. Workaround 1: Force Vector Path

    • If we limit the injection chunk size to 8 (processing 8 tokens at a time), llama.cpp falls back to the mul_mat_vec (Matrix-Vector) kernel.
    • Result: Success. The inference is stable and accurate. Reference: ggml_vk_mul_mat_vec handles accumulation differently/safer.
  2. Workaround 2: Use FP32 Model

    • If we use a full FP32 (non-quantized) GGUF model.
    • Result: Success. The backend selects the f32acc pipeline because source types are F32.
  3. Workaround 3: NVIDIA / CPU

    • The exact same code/inputs work perfectly on NVIDIA GPUs (Vulkan & CUDA) and CPU.
    • This suggests Intel iGPUs are stricter or handle F16 overflow/NaNs differently in the specific mul_mat_mat shader used by llama.cpp.

Request / Proposed Solution

We are using llama.cpp for a specialized workflow (Dual-Encoder / Multi-modal injection), not just standard text generation.

Could you please provide a way to Disable F16 Accumulation (force_f32_acc) for the Vulkan backend at runtime?

  • Env Variable: e.g., GGML_VK_FORCE_F32_ACC=1
  • Context Param: A flag in llama_context_params.

Currently, the selection logic in ggml_vk_get_mul_mat_mat_pipeline seems hardcoded to prefer f16acc if the device claims F16 support. A switch to force f32acc would solve this crash for Intel iGPU users doing high-precision or embedding-injection tasks.

Code Context (ggml-vulkan.cpp)

The issue likely lies here:

// ggml-vulkan.cpp ~line 5700
if (prec == GGML_PREC_DEFAULT && ctx->device->fp16 ...) {
    if (src0_type == GGML_TYPE_F16 && src1_type == GGML_TYPE_F32) {
        return ctx->device->pipeline_matmul_f16_f32.f16acc; // <--- OVERFLOWS 
    }
}

We need a way to force the .f32acc path without recompiling the project.

reproduce-crash.zip.zip

First Bad Commit

No response

Relevant log output

Logs

For Nvidia GPU:

Loading embeddings from embedding_slice_0_160000.pkl...
Input Shape: (155, 1024)
Stats: Min=-127.06, Max=1250.23, Mean=0.47
Injecting with Chunk Size: 512 (Expect failure/NaN on iGPU Matrix Path)
  Decoding chunk 0:155...
Injection Done in 0.14s

=== Generation Test (Streaming) ===
  Output: 是星期日,欢迎收看一千零四起事件消息,请静静介绍话题。去年十月十九日,九百六十七期节目说到委内瑞拉问题,我们 
回顾一下你当时的评。<|im_end|><|im_end|>五。<|im_end|><|im_end|><|im_end|>

[FINISHED]

For Intel iGPU:

Loading embeddings from embedding_slice_0_160000.pkl...
Input Shape: (155, 1024)
Stats: Min=-127.06, Max=1250.23, Mean=0.47
Injecting with Chunk Size: 512 (Expect failure/NaN on iGPU Matrix Path)
  Decoding chunk 0:155...
Injection Done in 0.02s

=== Generation Test (Streaming) ===
  Output:
  [FAIL] Logits contain NaN! (Issue Reproduced)


[FINISHED]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions