Eval bug: LFM2-VL giving different outputs on large images.

### Name and Version

load_backend: loaded RPC backend from C:\Users\metal\Downloads\llama-b7066-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\metal\Downloads\llama-b7066-bin-win-cpu-x64\ggml-cpu-alderlake.dll
version: 7066 (1568d13c2)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

### Operating systems

Windows

### GGML backends

CPU, CUDA

### Hardware

Intel i5-12500H + RTX 3050M

### Models

[LFM2-VL-1.6B-Q8_0.gguf](https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/blob/main/LFM2-VL-1.6B-Q8_0.gguf)
[mmproj-LFM2-VL-1.6B-Q8_0.gguf](https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/blob/main/mmproj-LFM2-VL-1.6B-Q8_0.gguf)

### Problem description & steps to reproduce

When running LFM2-VL-1.6B on an image with one side having a dimension of 1024px, the outputs become vastly different than when compared to PyTorch. Not just different, but totally erroneous.

#### Example

<img width="512" height="64" alt="Image" src="https://github.com/user-attachments/assets/ffbe991c-8997-42ba-8379-aeb976475608" />

`llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_512.png -p "OCR" -ngl 99 -t 0.0 -s 72`

Outputs:

`Guis mod est que motor modi modi doluplis deison et que que nosen-`

That's the 512x64 image. Okay results so far. Now here's the 1024x64 image:

<img width="1024" height="64" alt="Image" src="https://github.com/user-attachments/assets/4e07baf2-9916-4681-9085-b2a14a048aaa" />

`llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_1024.png -p "OCR." -ngl 99 -t 0.0 -s 72`

Outputs:

`Il semble que vous ayez une phrase ou un texte qui a ├⌐t├⌐ d├⌐form├⌐ ou distordu par une distorsion d'image ou une manipulation`

Now if we use the original PyTorch model on the 1024x64 image, we get this output:

`Cius modus est que modi dolipsis desion et que que nonsens`

Much more relevant to the input image, unlike the GGUF.

Other weird behavior can be seen when running the model on images with 1024px on one side. I'm also currently in the process of fine-tuning OCR variants for LFM2-VL and noticed the unquantized GGUFs just giving broken responses for other images with 1024px.

It's as if the model isn't "seeing" the image properly anymore. If I understand correctly, LFM2-VL has some sort of tiled preprocessing and a [global thumbnail for large images](https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models). Is all that logic fully implemented in llama.cpp for LFM2-VL?

Using GPU and/or disabling flash attention seems to make no difference. Images at 512 pixels or under have no issue.

I also tested on build b6195 - shortly after LFM2-VL was introduced - sadly similar behavior can be observed.

### First Bad Commit

_No response_

### Relevant log output

```shell
llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_1024.png -p "OCR." -ngl 99 -t 0.0 -s 72

...

clip_model_loader: model name:   LFM2 VL 1.6B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    427
clip_model_loader: n_kv:         27

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          lfm2
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            26
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2048

--- vision hparams ---
load_hparams: image_size:         256
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       0
load_hparams: image_min_pixels:   65536
load_hparams: image_max_pixels:   262144

load_hparams: model size:         537.96 MiB
load_hparams: metadata size:      0.15 MiB
alloc_compute_meta: warmup with image size = 512 x 512
alloc_compute_meta:        CPU compute buffer size =    30.31 MiB
alloc_compute_meta: graph splits = 1, nodes = 840
warmup: flash attention is enabled
main: loading model: LFM2-VL-1.6B-Q8_0.gguf
encoding image slice...
image slice encoded in 666 ms
decoding image batch 1/1, n_tokens_batch = 69
image decoded (batch 1/1) in 310 ms

"Guis mod est que motor modi modi doluplis deison et que que nosen-"


llama_perf_context_print:        load time =     318.87 ms
llama_perf_context_print: prompt eval time =    1076.26 ms /    80 tokens (   13.45 ms per token,    74.33 tokens per second)
llama_perf_context_print:        eval time =     935.66 ms /    24 runs   (   38.99 ms per token,    25.65 tokens per second)
llama_perf_context_print:       total time =    2403.07 ms /   104 tokens
llama_perf_context_print:    graphs reused =          0

llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_1024.png -p "OCR" -ngl 99 -t 0.0 -s 72

...

clip_model_loader: model name:   LFM2 VL 1.6B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    427
clip_model_loader: n_kv:         27

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          lfm2
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            26
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2048

--- vision hparams ---
load_hparams: image_size:         256
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       0
load_hparams: image_min_pixels:   65536
load_hparams: image_max_pixels:   262144

load_hparams: model size:         537.96 MiB
load_hparams: metadata size:      0.15 MiB
alloc_compute_meta: warmup with image size = 512 x 512
alloc_compute_meta:        CPU compute buffer size =    30.31 MiB
alloc_compute_meta: graph splits = 1, nodes = 840
warmup: flash attention is enabled
main: loading model: LFM2-VL-1.6B-Q8_0.gguf
encoding image slice...
image slice encoded in 621 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 346 ms

Il semble que vous ayez une phrase ou un texte qui a ├⌐t├⌐ invers├⌐ ou d├⌐cal├⌐. Pourriez-vous me donner plus de contexte

llama_perf_context_print:        load time =     316.52 ms
llama_perf_context_print: prompt eval time =    1047.88 ms /    75 tokens (   13.97 ms per token,    71.57 tokens per second)
llama_perf_context_print:        eval time =     929.30 ms /    29 runs   (   32.04 ms per token,    31.21 tokens per second)
llama_perf_context_print:       total time =    2395.44 ms /   104 tokens
llama_perf_context_print:    graphs reused =          0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: LFM2-VL giving different outputs on large images. #17290

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Example

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: LFM2-VL giving different outputs on large images. #17290

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Example

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions