model: LFM2-VL fixes #17577

tdakhran · 2025-11-28T16:15:37Z

Debugging of #17290 revealed multiple issues with LFM2-VL.

This PR fixes the following issues and makes the output of llama.cpp equivalent to PyTorch

surround image embeddings with <|image_start|> and <|image_end|> tokens
use round_by_factor to calculate target width and height in "smart resize".
stretch the image to the width and height calculated by smart resize instead of padding
place image embeddings before user prompt
resize positional embedding with antialiasing enabled

The central issue was the resizing of positional embeddings. Siglip2 implementation in PyTorch uses F.interpolate(..., mode="bilinear", align_corners=False, antialias=True). Antialiasing only contributes during downscaling. When the image width or height is less than 256, the scaling of positional embeddings in llama.cpp produced numerically different results from PyTorch.

A new flag, ' GGML_SCALE_FLAG_ANTIALIAS', has been added for the upscale function, with implementations for CPU and CUDA.
Now outputs match:

PyTorch (fp32)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

this PR (bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf -n 64 -t 4 --image /data/playground/issue_17290/siglip_1024.png -p "OCR." --temp 0.0 --top-k 1)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

tdakhran · 2025-11-28T16:18:50Z

ggml/src/ggml-cpu/ops.cpp

                }
            }
        }
+    } else if (mode == GGML_SCALE_MODE_BILINEAR && (mode_flags & GGML_SCALE_FLAG_ANTIALIAS)) {


I was thinking whether to introduce a new scaling mode GGML_SCALE_MODE_BILINEAR_ANTIALIAS or a flag. I would like to hear your feedback on this.
Another question is where to place correctness tests?

I think it's fine to introduce this GGML_SCALE_FLAG_ANTIALIAS as a flag. Btw, I think we would need to explicitly GGML_ASSERT that the antialias is only supported by bilinear (maybe add a TODO to implement it in other modes)

tdakhran · 2025-11-28T16:19:03Z

ggml/src/ggml-cpu/ops.cpp

+            return std::max(1.0f - fabsf(x), 0.0f);
+        };
+
+        // support and invscale, maximum 1 pixel for bilinear


shall be minimum in the comment

tdakhran · 2025-11-28T16:19:21Z

ggml/src/ggml-cuda/upscale.cu

+    const float y = ((float)i11_dst + pixel_offset) / sf1;
+    const float x = ((float)i10_dst + pixel_offset) / sf0;
+
+    // support and invscale, maximum 1 pixel for bilinear


shall be minimum in the comment

tdakhran · 2025-11-28T16:19:54Z

ggml/src/ggml-cuda/upscale.cu

    if (mode == GGML_SCALE_MODE_NEAREST) {
        upscale_f32_cuda(src0_d, dst_d, src0->nb[0], src0->nb[1], src0->nb[2], src0->nb[3], dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3], sf0, sf1, sf2, sf3, stream);
    } else if (mode == GGML_SCALE_MODE_BILINEAR) {
+        bool antialias = (mode_flags & GGML_SCALE_FLAG_ANTIALIAS);


tdakhran · 2025-11-28T16:21:06Z

tools/mtmd/clip.cpp

                        get_u32(KEY_PROJ_SCALE_FACTOR, hparams.n_merge, false);
                        // ref: https://huggingface.co/LiquidAI/LFM2-VL-3B/blob/main/preprocessor_config.json
-                        hparams.set_limit_image_tokens(64, 256);
+                        hparams.set_limit_image_tokens(64, 1024);


Add a comment that in the config, the number of tokens is after downsampling, while here it is before.

ngxson · 2025-11-28T16:20:21Z

tools/mtmd/mtmd.h

 };

+// insert mtmd_default_marker() into given string, position depends on the projector
+std::string mtmd_add_default_marker(mtmd_context *ctx, const std::string &str);


I'd prefer to remove API because:

It's not compatible with pure-C ABI

The ordering is actually controlled by users via API. This function only change llama-mtmd-cli, but make no changes to llama-server

A function like mtmd_get_image_placement can be a better solution, it returns one of these 3 values which should cover all use cases possible:

IMAGE_PLACEMENT_NONE, // place images freely inside the message IMAGE_PLACEMENT_BEGIN, // place images in the beginning of the image IMAGE_PLACEMENT_END, // place in the end

But IMO this can better be a dedicated refactoring

I'll remove it from this PR. For now, passing a placeholder directly in -p "__media__>OCR." achieves the same. But I wanted the placement to be correct by default.

My understanding is that for the server, the order is similar to the order of data in the request, for the CLI, it was using content += mtmd_default_marker();.

For CLI, it was done this way because you can actually add multiple images in CLI mode, something like this:

> This is the first step: > /image step1.png > Then the next step: > /image step2.png > What do you see?

In this case, we expect the image and text prompts to follow exactly the same order in the input.

ngxson

I think we probably need to update ggml_backend_*_supports_op across backend, to avoid some backend fallback to the non-antialias kernel, which will result in wrong results.

For backend that does not support this mode, it will be fallback to CPU

SmartestWashingMachine · 2025-11-28T22:00:54Z

On my end (CPU) the outputs of fp32 and bf16 450M looked good, tested on a variety of small images (< 16/32px one side).

Also checked a few personal tuned 1.6B Q3s (which should be more sensitive) and the outputs were great - it didn't go into a repetitive "breaking" state like before!

It couldn't have been easy to figure this issue out... Thank you guys for looking into this!

tdakhran added 10 commits November 28, 2025 16:52

Adjust to pytorch

baacb35

Add antialiasing upscale

bfdf624

Increase number of patches to 1024

c236bee

Handle default marker insertion for LFM2

1fea2d1

Switch to flag

1f4f2bd

Reformat

b1ba04d

Cuda implementation of antialias kernel

7c27ef9

Change placement in ops.cpp

c03f031

consistent float literals

215388f

Pad only for LFM2

50ba22e

tdakhran requested review from ggerganov and ngxson as code owners November 28, 2025 16:15

tdakhran commented Nov 28, 2025

View reviewed changes

tdakhran mentioned this pull request Nov 28, 2025

Eval bug: LFM2-VL giving different outputs on large images. #17290

Open

ngxson reviewed Nov 28, 2025

View reviewed changes

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17577: model: LFM2-VL fixes auroralabs-loci/llama.cpp#350

Open

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025

ngxson reviewed Nov 28, 2025

View reviewed changes

model: LFM2-VL fixes #17577

Are you sure you want to change the base?

model: LFM2-VL fixes #17577

Conversation

tdakhran commented Nov 28, 2025

Uh oh!

tdakhran Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

SmartestWashingMachine commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdakhran Nov 28, 2025 •

edited

Loading

SmartestWashingMachine commented Nov 28, 2025 •

edited

Loading