Name and Version
load_backend: loaded RPC backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140T GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 5050 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-cpu-alderlake.dll
version: 7761 (a89002f)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
Problem description & steps to reproduce
Issue Description
When using llama.cpp as a decoder for an ASR (Speech-to-Text) system. Specificlly, Fun-ASR-nano, which uses a Qwen3-0.6B LLM Decoder as audio feature decoder, I came to a idea, using llama.cpp to accelerate it's decoding process.
We are injecting embeddings directly (batch.embd) instead of token IDs.
On Intel iGPUs (specifically Intel Arc 140T) using the Vulkan backend, this operations fails with NaNs/Crashes when the batch size exceeds 8.
This appears to be caused by the Vulkan backend defaulting to F16 Accumulation (f16acc) for mul_mat_mat operations on quantized models, which overflows given the magnitude of the injected audio embeddings.
Environment
- OS: Windows 10 22H2
- Hardware: Intel Arc 140T (iGPU)
- Backend: Vulkan (
ggml-vulkan)
- Model: Qwen3-0.6B (Quantized: fp16)
- Llama.cpp Version: Release b7783
Reproduction Steps
- Initialize a
llama_context with a Quantized Model (e.g., fp16).
- Prepare a batch of embeddings (
batch.embd) representing audio features.
- Magnitude: Values range from approx -150 to +1200.
- Sequence Length: ~300 tokens (e.g., 10-20 seconds of audio).
- Call
llama_decode(ctx, batch).
Observed Behavior
- If Batch Size > 8 (e.g., 512):
- The Vulkan backend selects the
mul_mat_mat pipeline (Matrix-Matrix multiplication).
- Result: The output logits become
NaN .
- Cause: The highly optimized
mul_mat_mat kernel for quantized types defaults to F16 Accumulation if the device supports F16. The dot product of our embeddings (magnitude ~1000) and weights exceeds the F16 dynamic range (65504), leading to overflow.
Investigation & Evidence
I have verified this hypothesis through extensive testing:
-
Workaround 1: Force Vector Path
- If we limit the injection chunk size to
8 (processing 8 tokens at a time), llama.cpp falls back to the mul_mat_vec (Matrix-Vector) kernel.
- Result: Success. The inference is stable and accurate. Reference:
ggml_vk_mul_mat_vec handles accumulation differently/safer.
-
Workaround 2: Use FP32 Model
- If we use a full
FP32 (non-quantized) GGUF model.
- Result: Success. The backend selects the
f32acc pipeline because source types are F32.
-
Workaround 3: NVIDIA / CPU
- The exact same code/inputs work perfectly on NVIDIA GPUs (Vulkan & CUDA) and CPU.
- This suggests Intel iGPUs are stricter or handle F16 overflow/NaNs differently in the specific
mul_mat_mat shader used by llama.cpp.
Request / Proposed Solution
We are using llama.cpp for a specialized workflow (Dual-Encoder / Multi-modal injection), not just standard text generation.
Could you please provide a way to Disable F16 Accumulation (force_f32_acc) for the Vulkan backend at runtime?
- Env Variable: e.g.,
GGML_VK_FORCE_F32_ACC=1
- Context Param: A flag in
llama_context_params.
Currently, the selection logic in ggml_vk_get_mul_mat_mat_pipeline seems hardcoded to prefer f16acc if the device claims F16 support. A switch to force f32acc would solve this crash for Intel iGPU users doing high-precision or embedding-injection tasks.
Code Context (ggml-vulkan.cpp)
The issue likely lies here:
// ggml-vulkan.cpp ~line 5700
if (prec == GGML_PREC_DEFAULT && ctx->device->fp16 ...) {
if (src0_type == GGML_TYPE_F16 && src1_type == GGML_TYPE_F32) {
return ctx->device->pipeline_matmul_f16_f32.f16acc; // <--- OVERFLOWS
}
}
We need a way to force the .f32acc path without recompiling the project.
reproduce-crash.zip.zip
First Bad Commit
No response
Relevant log output
Logs
For Nvidia GPU:
Loading embeddings from embedding_slice_0_160000.pkl...
Input Shape: (155, 1024)
Stats: Min=-127.06, Max=1250.23, Mean=0.47
Injecting with Chunk Size: 512 (Expect failure/NaN on iGPU Matrix Path)
Decoding chunk 0:155...
Injection Done in 0.14s
=== Generation Test (Streaming) ===
Output: 是星期日,欢迎收看一千零四起事件消息,请静静介绍话题。去年十月十九日,九百六十七期节目说到委内瑞拉问题,我们
回顾一下你当时的评。<|im_end|><|im_end|>五。<|im_end|><|im_end|><|im_end|>
[FINISHED]
For Intel iGPU:
Loading embeddings from embedding_slice_0_160000.pkl...
Input Shape: (155, 1024)
Stats: Min=-127.06, Max=1250.23, Mean=0.47
Injecting with Chunk Size: 512 (Expect failure/NaN on iGPU Matrix Path)
Decoding chunk 0:155...
Injection Done in 0.02s
=== Generation Test (Streaming) ===
Output:
[FAIL] Logits contain NaN! (Issue Reproduced)
[FINISHED]
Name and Version
load_backend: loaded RPC backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140T GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 5050 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-cpu-alderlake.dll
version: 7761 (a89002f)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
Problem description & steps to reproduce
Issue Description
When using
llama.cppas a decoder for an ASR (Speech-to-Text) system. Specificlly, Fun-ASR-nano, which uses a Qwen3-0.6B LLM Decoder as audio feature decoder, I came to a idea, using llama.cpp to accelerate it's decoding process.We are injecting embeddings directly (
batch.embd) instead of token IDs.On Intel iGPUs (specifically Intel Arc 140T) using the Vulkan backend, this operations fails with NaNs/Crashes when the batch size exceeds 8.
This appears to be caused by the Vulkan backend defaulting to F16 Accumulation (
f16acc) formul_mat_matoperations on quantized models, which overflows given the magnitude of the injected audio embeddings.Environment
ggml-vulkan)Reproduction Steps
llama_contextwith a Quantized Model (e.g., fp16).batch.embd) representing audio features.llama_decode(ctx, batch).Observed Behavior
mul_mat_matpipeline (Matrix-Matrix multiplication).NaN.mul_mat_matkernel for quantized types defaults to F16 Accumulation if the device supports F16. The dot product of our embeddings (magnitude ~1000) and weights exceeds the F16 dynamic range (65504), leading to overflow.Investigation & Evidence
I have verified this hypothesis through extensive testing:
Workaround 1: Force Vector Path
8(processing 8 tokens at a time),llama.cppfalls back to themul_mat_vec(Matrix-Vector) kernel.ggml_vk_mul_mat_vechandles accumulation differently/safer.Workaround 2: Use FP32 Model
FP32(non-quantized) GGUF model.f32accpipeline because source types are F32.Workaround 3: NVIDIA / CPU
mul_mat_matshader used byllama.cpp.Request / Proposed Solution
We are using
llama.cppfor a specialized workflow (Dual-Encoder / Multi-modal injection), not just standard text generation.Could you please provide a way to Disable F16 Accumulation (
force_f32_acc) for the Vulkan backend at runtime?GGML_VK_FORCE_F32_ACC=1llama_context_params.Currently, the selection logic in
ggml_vk_get_mul_mat_mat_pipelineseems hardcoded to preferf16accif the device claims F16 support. A switch to forcef32accwould solve this crash for Intel iGPU users doing high-precision or embedding-injection tasks.Code Context (ggml-vulkan.cpp)
The issue likely lies here:
We need a way to force the
.f32accpath without recompiling the project.reproduce-crash.zip.zip
First Bad Commit
No response
Relevant log output
Logs
For Nvidia GPU:
For Intel iGPU: