-
Notifications
You must be signed in to change notification settings - Fork 13.9k
model: LFM2-VL fixes #17577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model: LFM2-VL fixes #17577
Conversation
| } | ||
| } | ||
| } | ||
| } else if (mode == GGML_SCALE_MODE_BILINEAR && (mode_flags & GGML_SCALE_FLAG_ANTIALIAS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking whether to introduce a new scaling mode GGML_SCALE_MODE_BILINEAR_ANTIALIAS or a flag. I would like to hear your feedback on this.
Another question is where to place correctness tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to introduce this GGML_SCALE_FLAG_ANTIALIAS as a flag. Btw, I think we would need to explicitly GGML_ASSERT that the antialias is only supported by bilinear (maybe add a TODO to implement it in other modes)
| return std::max(1.0f - fabsf(x), 0.0f); | ||
| }; | ||
|
|
||
| // support and invscale, maximum 1 pixel for bilinear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall be minimum in the comment
| const float y = ((float)i11_dst + pixel_offset) / sf1; | ||
| const float x = ((float)i10_dst + pixel_offset) / sf0; | ||
|
|
||
| // support and invscale, maximum 1 pixel for bilinear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall be minimum in the comment
| if (mode == GGML_SCALE_MODE_NEAREST) { | ||
| upscale_f32_cuda(src0_d, dst_d, src0->nb[0], src0->nb[1], src0->nb[2], src0->nb[3], dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3], sf0, sf1, sf2, sf3, stream); | ||
| } else if (mode == GGML_SCALE_MODE_BILINEAR) { | ||
| bool antialias = (mode_flags & GGML_SCALE_FLAG_ANTIALIAS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make const
| get_u32(KEY_PROJ_SCALE_FACTOR, hparams.n_merge, false); | ||
| // ref: https://huggingface.co/LiquidAI/LFM2-VL-3B/blob/main/preprocessor_config.json | ||
| hparams.set_limit_image_tokens(64, 256); | ||
| hparams.set_limit_image_tokens(64, 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that in the config, the number of tokens is after downsampling, while here it is before.
| }; | ||
|
|
||
| // insert mtmd_default_marker() into given string, position depends on the projector | ||
| std::string mtmd_add_default_marker(mtmd_context *ctx, const std::string &str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to remove API because:
- It's not compatible with pure-C ABI
- The ordering is actually controlled by users via API. This function only change
llama-mtmd-cli, but make no changes tollama-server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A function like mtmd_get_image_placement can be a better solution, it returns one of these 3 values which should cover all use cases possible:
IMAGE_PLACEMENT_NONE, // place images freely inside the message
IMAGE_PLACEMENT_BEGIN, // place images in the beginning of the image
IMAGE_PLACEMENT_END, // place in the endBut IMO this can better be a dedicated refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it from this PR. For now, passing a placeholder directly in -p "__media__>OCR." achieves the same. But I wanted the placement to be correct by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that for the server, the order is similar to the order of data in the request, for the CLI, it was using content += mtmd_default_marker();.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CLI, it was done this way because you can actually add multiple images in CLI mode, something like this:
> This is the first step:
> /image step1.png
> Then the next step:
> /image step2.png
> What do you see?
In this case, we expect the image and text prompts to follow exactly the same order in the input.
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably need to update ggml_backend_*_supports_op across backend, to avoid some backend fallback to the non-antialias kernel, which will result in wrong results.
For backend that does not support this mode, it will be fallback to CPU
|
On my end (CPU) the outputs of fp32 and bf16 450M looked good, tested on a variety of small images (< 16/32px one side). Also checked a few personal tuned 1.6B Q3s (which should be more sensitive) and the outputs were great - it didn't go into a repetitive "breaking" state like before! It couldn't have been easy to figure this issue out... Thank you guys for looking into this! |
Debugging of #17290 revealed multiple issues with LFM2-VL.
This PR fixes the following issues and makes the output of
llama.cppequivalent to PyTorchround_by_factorto calculate targetwidthandheightin "smart resize".The central issue was the resizing of positional embeddings. Siglip2 implementation in PyTorch uses
F.interpolate(..., mode="bilinear", align_corners=False, antialias=True). Antialiasing only contributes during downscaling. When the image width or height is less than 256, the scaling of positional embeddings inllama.cppproduced numerically different results from PyTorch.A new flag, ' GGML_SCALE_FLAG_ANTIALIAS', has been added for the upscale function, with implementations for CPU and CUDA.

Now outputs match:
PyTorch (fp32)
this PR (
bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf -n 64 -t 4 --image /data/playground/issue_17290/siglip_1024.png -p "OCR." --temp 0.0 --top-k 1)