Misc. [Bug]: Vulkan Backend - NaN on Intel iGPU when injecting Embeddings (f16 acc overflow)

### Name and Version

load_backend: loaded RPC backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140T GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 5050 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\Haujet\Downloads\Compressed\bin\ggml-cpu-alderlake.dll
version: 7761 (a89002f07)
built with Clang 19.1.5 for Windows x86_64



### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

libllama (core library)

### Command line

```shell

```

### Problem description & steps to reproduce


### Issue Description
When using `llama.cpp` as a decoder for an ASR (Speech-to-Text) system. Specificlly, Fun-ASR-nano, which uses a Qwen3-0.6B LLM Decoder as audio feature decoder, I came to a idea, using llama.cpp to accelerate it's decoding process. 

We are **injecting embeddings directly** (`batch.embd`) instead of token IDs. 

On **Intel iGPUs** (specifically Intel Arc 140T) using the **Vulkan** backend, this operations fails with NaNs/Crashes when the batch size exceeds 8.

This appears to be caused by the Vulkan backend defaulting to **F16 Accumulation** (`f16acc`) for `mul_mat_mat` operations on quantized models, which overflows given the magnitude of the injected audio embeddings.

### Environment
- **OS**: Windows 10 22H2 
- **Hardware**: Intel Arc 140T (iGPU)
- **Backend**: Vulkan (`ggml-vulkan`)
- **Model**: Qwen3-0.6B  (Quantized: fp16)
- **Llama.cpp Version**: Release b7783

### Reproduction Steps
1.  Initialize a `llama_context` with a Quantized Model (e.g., fp16).
2.  Prepare a batch of embeddings (`batch.embd`) representing audio features.
    -   **Magnitude**: Values range from approx -150 to +1200.
    -   **Sequence Length**: ~300 tokens (e.g., 10-20 seconds of audio).
3.  Call `llama_decode(ctx, batch)`.

### Observed Behavior
-   **If Batch Size > 8** (e.g., 512):
    -   The Vulkan backend selects the `mul_mat_mat` pipeline (Matrix-Matrix multiplication).
    -   **Result**: The output logits become `NaN` .
    -   **Cause**: The highly optimized `mul_mat_mat` kernel for quantized types defaults to **F16 Accumulation** if the device supports F16. The dot product of our embeddings (magnitude ~1000) and weights exceeds the F16 dynamic range (65504), leading to overflow.

### Investigation & Evidence
I have verified this hypothesis through extensive testing:

1.  **Workaround 1: Force Vector Path**
    -   If we limit the injection chunk size to `8` (processing 8 tokens at a time), `llama.cpp` falls back to the `mul_mat_vec` (Matrix-Vector) kernel.
    -   **Result**: **Success**. The inference is stable and accurate. Reference: `ggml_vk_mul_mat_vec` handles accumulation differently/safer.

2.  **Workaround 2: Use FP32 Model**
    -   If we use a full `FP32` (non-quantized) GGUF model.
    -   **Result**: **Success**. The backend selects the `f32acc` pipeline because source types are F32.

3.  **Workaround 3: NVIDIA / CPU**
    -   The exact same code/inputs work perfectly on **NVIDIA GPUs** (Vulkan & CUDA) and **CPU**.
    -   This suggests Intel iGPUs are stricter or handle F16 overflow/NaNs differently in the specific `mul_mat_mat` shader used by `llama.cpp`.


### Request / Proposed Solution
We are using `llama.cpp` for a specialized workflow (Dual-Encoder / Multi-modal injection), not just standard text generation.

Could you please provide a way to **Disable F16 Accumulation** (`force_f32_acc`) for the Vulkan backend at runtime?
-   **Env Variable**: e.g., `GGML_VK_FORCE_F32_ACC=1`
-   **Context Param**: A flag in `llama_context_params`.

Currently, the selection logic in `ggml_vk_get_mul_mat_mat_pipeline` seems hardcoded to prefer `f16acc` if the device claims F16 support. A switch to force `f32acc` would solve this crash for Intel iGPU users doing high-precision or embedding-injection tasks.

### Code Context (ggml-vulkan.cpp)
The issue likely lies here:
```cpp
// ggml-vulkan.cpp ~line 5700
if (prec == GGML_PREC_DEFAULT && ctx->device->fp16 ...) {
    if (src0_type == GGML_TYPE_F16 && src1_type == GGML_TYPE_F32) {
        return ctx->device->pipeline_matmul_f16_f32.f16acc; // <--- OVERFLOWS 
    }
}
```

We need a way to force the `.f32acc` path without recompiling the project.

 

[reproduce-crash.zip.zip](https://github.com/user-attachments/files/24748370/reproduce-crash.zip.zip)

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


For Nvidia GPU: 

```
Loading embeddings from embedding_slice_0_160000.pkl...
Input Shape: (155, 1024)
Stats: Min=-127.06, Max=1250.23, Mean=0.47
Injecting with Chunk Size: 512 (Expect failure/NaN on iGPU Matrix Path)
  Decoding chunk 0:155...
Injection Done in 0.14s

=== Generation Test (Streaming) ===
  Output: 是星期日，欢迎收看一千零四起事件消息，请静静介绍话题。去年十月十九日，九百六十七期节目说到委内瑞拉问题，我们 
回顾一下你当时的评。<|im_end|><|im_end|>五。<|im_end|><|im_end|><|im_end|>

[FINISHED]
```


For Intel iGPU: 

```
Loading embeddings from embedding_slice_0_160000.pkl...
Input Shape: (155, 1024)
Stats: Min=-127.06, Max=1250.23, Mean=0.47
Injecting with Chunk Size: 512 (Expect failure/NaN on iGPU Matrix Path)
  Decoding chunk 0:155...
Injection Done in 0.02s

=== Generation Test (Streaming) ===
  Output:
  [FAIL] Logits contain NaN! (Issue Reproduced)


[FINISHED]
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. [Bug]: Vulkan Backend - NaN on Intel iGPU when injecting Embeddings (f16 acc overflow) #18969

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Issue Description

Environment

Reproduction Steps

Observed Behavior

Investigation & Evidence

Request / Proposed Solution

Code Context (ggml-vulkan.cpp)

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. [Bug]: Vulkan Backend - NaN on Intel iGPU when injecting Embeddings (f16 acc overflow) #18969

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Issue Description

Environment

Reproduction Steps

Observed Behavior

Investigation & Evidence

Request / Proposed Solution

Code Context (ggml-vulkan.cpp)

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions