Skip to content

Conversation

@tdakhran
Copy link
Contributor

Debugging of #17290 revealed multiple issues with LFM2-VL.

This PR fixes the following issues and makes the output of llama.cpp equivalent to PyTorch

  • surround image embeddings with <|image_start|> and <|image_end|> tokens
  • use round_by_factor to calculate target width and height in "smart resize".
  • stretch the image to the width and height calculated by smart resize instead of padding
  • place image embeddings before user prompt
  • resize positional embedding with antialiasing enabled

The central issue was the resizing of positional embeddings. Siglip2 implementation in PyTorch uses F.interpolate(..., mode="bilinear", align_corners=False, antialias=True). Antialiasing only contributes during downscaling. When the image width or height is less than 256, the scaling of positional embeddings in llama.cpp produced numerically different results from PyTorch.

A new flag, ' GGML_SCALE_FLAG_ANTIALIAS', has been added for the upscale function, with implementations for CPU and CUDA.
Now outputs match:
siglip_1024
PyTorch (fp32)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

this PR (bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf -n 64 -t 4 --image /data/playground/issue_17290/siglip_1024.png -p "OCR." --temp 0.0 --top-k 1)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

}
}
}
} else if (mode == GGML_SCALE_MODE_BILINEAR && (mode_flags & GGML_SCALE_FLAG_ANTIALIAS)) {
Copy link
Contributor Author

@tdakhran tdakhran Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking whether to introduce a new scaling mode GGML_SCALE_MODE_BILINEAR_ANTIALIAS or a flag. I would like to hear your feedback on this.
Another question is where to place correctness tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to introduce this GGML_SCALE_FLAG_ANTIALIAS as a flag. Btw, I think we would need to explicitly GGML_ASSERT that the antialias is only supported by bilinear (maybe add a TODO to implement it in other modes)

return std::max(1.0f - fabsf(x), 0.0f);
};

// support and invscale, maximum 1 pixel for bilinear
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall be minimum in the comment

const float y = ((float)i11_dst + pixel_offset) / sf1;
const float x = ((float)i10_dst + pixel_offset) / sf0;

// support and invscale, maximum 1 pixel for bilinear
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall be minimum in the comment

if (mode == GGML_SCALE_MODE_NEAREST) {
upscale_f32_cuda(src0_d, dst_d, src0->nb[0], src0->nb[1], src0->nb[2], src0->nb[3], dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3], sf0, sf1, sf2, sf3, stream);
} else if (mode == GGML_SCALE_MODE_BILINEAR) {
bool antialias = (mode_flags & GGML_SCALE_FLAG_ANTIALIAS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make const

get_u32(KEY_PROJ_SCALE_FACTOR, hparams.n_merge, false);
// ref: https://huggingface.co/LiquidAI/LFM2-VL-3B/blob/main/preprocessor_config.json
hparams.set_limit_image_tokens(64, 256);
hparams.set_limit_image_tokens(64, 1024);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that in the config, the number of tokens is after downsampling, while here it is before.

};

// insert mtmd_default_marker() into given string, position depends on the projector
std::string mtmd_add_default_marker(mtmd_context *ctx, const std::string &str);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to remove API because:

  • It's not compatible with pure-C ABI
  • The ordering is actually controlled by users via API. This function only change llama-mtmd-cli, but make no changes to llama-server

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function like mtmd_get_image_placement can be a better solution, it returns one of these 3 values which should cover all use cases possible:

IMAGE_PLACEMENT_NONE, // place images freely inside the message
IMAGE_PLACEMENT_BEGIN, // place images in the beginning of the image
IMAGE_PLACEMENT_END, // place in the end

But IMO this can better be a dedicated refactoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it from this PR. For now, passing a placeholder directly in -p "__media__>OCR." achieves the same. But I wanted the placement to be correct by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that for the server, the order is similar to the order of data in the request, for the CLI, it was using content += mtmd_default_marker();.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CLI, it was done this way because you can actually add multiple images in CLI mode, something like this:

> This is the first step:
> /image step1.png
> Then the next step:
> /image step2.png
> What do you see?

In this case, we expect the image and text prompts to follow exactly the same order in the input.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need to update ggml_backend_*_supports_op across backend, to avoid some backend fallback to the non-antialias kernel, which will result in wrong results.

For backend that does not support this mode, it will be fallback to CPU

@SmartestWashingMachine
Copy link

SmartestWashingMachine commented Nov 28, 2025

On my end (CPU) the outputs of fp32 and bf16 450M looked good, tested on a variety of small images (< 16/32px one side).

Also checked a few personal tuned 1.6B Q3s (which should be more sensitive) and the outputs were great - it didn't go into a repetitive "breaking" state like before!

It couldn't have been easy to figure this issue out... Thank you guys for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants