-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
Name and Version
load_backend: loaded RPC backend from C:\Users\metal\Downloads\llama-b7066-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\metal\Downloads\llama-b7066-bin-win-cpu-x64\ggml-cpu-alderlake.dll
version: 7066 (1568d13)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
CPU, CUDA
Hardware
Intel i5-12500H + RTX 3050M
Models
LFM2-VL-1.6B-Q8_0.gguf
mmproj-LFM2-VL-1.6B-Q8_0.gguf
Problem description & steps to reproduce
When running LFM2-VL-1.6B on an image with one side having a dimension of 1024px, the outputs become vastly different than when compared to PyTorch. Not just different, but totally erroneous.
Example
llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_512.png -p "OCR" -ngl 99 -t 0.0 -s 72
Outputs:
Guis mod est que motor modi modi doluplis deison et que que nosen-
That's the 512x64 image. Okay results so far. Now here's the 1024x64 image:
llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_1024.png -p "OCR." -ngl 99 -t 0.0 -s 72
Outputs:
Il semble que vous ayez une phrase ou un texte qui a été déformé ou distordu par une distorsion d'image ou une manipulation
Now if we use the original PyTorch model on the 1024x64 image, we get this output:
Cius modus est que modi dolipsis desion et que que nonsens
Much more relevant to the input image, unlike the GGUF.
Other weird behavior can be seen when running the model on images with 1024px on one side. I'm also currently in the process of fine-tuning OCR variants for LFM2-VL and noticed the unquantized GGUFs just giving broken responses for other images with 1024px.
It's as if the model isn't "seeing" the image properly anymore. If I understand correctly, LFM2-VL has some sort of tiled preprocessing and a global thumbnail for large images. Is all that logic fully implemented in llama.cpp for LFM2-VL?
Using GPU and/or disabling flash attention seems to make no difference. Images at 512 pixels or under have no issue.
I also tested on build b6195 - shortly after LFM2-VL was introduced - sadly similar behavior can be observed.
First Bad Commit
No response
Relevant log output
llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_1024.png -p "OCR." -ngl 99 -t 0.0 -s 72
...
clip_model_loader: model name: LFM2 VL 1.6B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 427
clip_model_loader: n_kv: 27
clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector: lfm2
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 26
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 2048
--- vision hparams ---
load_hparams: image_size: 256
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: n_merge: 2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels: 65536
load_hparams: image_max_pixels: 262144
load_hparams: model size: 537.96 MiB
load_hparams: metadata size: 0.15 MiB
alloc_compute_meta: warmup with image size = 512 x 512
alloc_compute_meta: CPU compute buffer size = 30.31 MiB
alloc_compute_meta: graph splits = 1, nodes = 840
warmup: flash attention is enabled
main: loading model: LFM2-VL-1.6B-Q8_0.gguf
encoding image slice...
image slice encoded in 666 ms
decoding image batch 1/1, n_tokens_batch = 69
image decoded (batch 1/1) in 310 ms
"Guis mod est que motor modi modi doluplis deison et que que nosen-"
llama_perf_context_print: load time = 318.87 ms
llama_perf_context_print: prompt eval time = 1076.26 ms / 80 tokens ( 13.45 ms per token, 74.33 tokens per second)
llama_perf_context_print: eval time = 935.66 ms / 24 runs ( 38.99 ms per token, 25.65 tokens per second)
llama_perf_context_print: total time = 2403.07 ms / 104 tokens
llama_perf_context_print: graphs reused = 0
llama-mtmd-cli.exe -m LFM2-VL-1.6B-Q8_0.gguf --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf -n 30 --image onefoul_1024.png -p "OCR" -ngl 99 -t 0.0 -s 72
...
clip_model_loader: model name: LFM2 VL 1.6B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 427
clip_model_loader: n_kv: 27
clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector: lfm2
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 26
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 2048
--- vision hparams ---
load_hparams: image_size: 256
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: n_merge: 2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels: 65536
load_hparams: image_max_pixels: 262144
load_hparams: model size: 537.96 MiB
load_hparams: metadata size: 0.15 MiB
alloc_compute_meta: warmup with image size = 512 x 512
alloc_compute_meta: CPU compute buffer size = 30.31 MiB
alloc_compute_meta: graph splits = 1, nodes = 840
warmup: flash attention is enabled
main: loading model: LFM2-VL-1.6B-Q8_0.gguf
encoding image slice...
image slice encoded in 621 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 346 ms
Il semble que vous ayez une phrase ou un texte qui a été inversé ou décalé. Pourriez-vous me donner plus de contexte
llama_perf_context_print: load time = 316.52 ms
llama_perf_context_print: prompt eval time = 1047.88 ms / 75 tokens ( 13.97 ms per token, 71.57 tokens per second)
llama_perf_context_print: eval time = 929.30 ms / 29 runs ( 32.04 ms per token, 31.21 tokens per second)
llama_perf_context_print: total time = 2395.44 ms / 104 tokens
llama_perf_context_print: graphs reused = 0