-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #17577: model: LFM2-VL fixes #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #17577: model: LFM2-VL fixes #350
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #350 - LFM2-VL Antialiasing ImplementationOverviewPR #350 implements antialiased bilinear interpolation for LFM2-VL vision model positional embedding resizing to match PyTorch numerical output. The changes add 165 lines across 8 files, introducing a new Key FindingsPerformance-Critical Function Impactlibggml-cpu.so upscale operator (ops.cpp:7467)
libggml-cpu.so second upscale operator (ops.cpp:7526)
Inference ImpactTokens per Second: No Impact The core inference functions (llama_decode, llama_encode, llama_tokenize) show no changes in response time or throughput. The upscale operations affected by this PR execute during vision encoder preprocessing for positional embedding resizing, not during token generation. Based on the reference that 2 ms slower llama_decode results in 7% fewer tokens per second, the 87 ns increase in upscale operations represents 0.004% of that threshold and has negligible impact on inference throughput. The changes affect vision model preprocessing only, specifically when processing images smaller than 256 pixels where positional embeddings require downscaling. Token generation speed remains unchanged. Power Consumption Analysislibggml-cpu.so: Power consumption increased by 1.27% (+1,464 nJ)
libmtmd.so: Power consumption increased by 1.05% (+1,361 nJ)
All other binaries (libggml-base.so, llama-bench, llama-run, libllama.so, etc.) show zero power consumption change. Implementation ContextThe performance regression is a correctness-focused trade-off. The antialiased implementation matches PyTorch's |
e4a4e1d to
d0b408b
Compare
50ba22e to
2385ecf
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #350PR Context: Introduces antialiasing support for bilinear interpolation in vision model processing (LFM2-VL fixes). Changes span 12 files with 162 additions and 13 deletions. OverviewThis PR adds a new Key FindingsMost-Impacted Functions1. Upscale Lambda Functions (libggml-cpu.so) The primary regression occurs in
The regression stems from the new antialiasing code path that replaces simple 4-point bilinear sampling with variable-range triangle-filtered sampling. The new implementation adds 59 lines of code including nested loops with dynamic range calculation, triangle filter weight computation, and weighted pixel accumulation. For downscaling operations, the sampling region expands from a fixed 2x2 grid to a variable NxN grid based on scale factor, resulting in 8-10x instruction count increase per output pixel. 2. STL Container Accessors (libmtmd.so)
These changes reflect compiler inlining inconsistencies unrelated to the PR's functional changes. The 3. Miniaudio I/O Functions (libmtmd.so) Multiple I/O callback functions show 17-22 ns response time increases:
These regressions appear to stem from enhanced validation or error handling in the VFS abstraction layer, with 80-92% of execution time in the functions' own logic rather than actual I/O operations. Inference Performance ImpactTokens per Second: No impact expected. The affected functions ( Power Consumption AnalysisImpacted Binaries:
Combined power regression across both binaries totals 2,716 nJ, representing a 1.1% increase in cumulative execution energy for vision and audio processing operations. Code Change AnalysisThe antialiasing implementation is functionally correct and achieves the stated goal of matching PyTorch's Backend support is implemented for CPU and CUDA, while Metal, SYCL, OpenCL, Vulkan, and CANN explicitly reject operations with the antialias flag, forcing CPU fallback. The CUDA implementation mirrors the CPU algorithm with one thread per output pixel. Additional changes include smart resize calculation switching from The performance regression is acceptable given the operation's infrequency (once per image during preprocessing) and the correctness requirement for vision model numerical accuracy. |
f96421a to
1854a53
Compare
Mirrored from ggml-org/llama.cpp#17577
Debugging of #17290 revealed multiple issues with LFM2-VL.
This PR fixes the following issues and makes the output of
llama.cppequivalent to PyTorchround_by_factorto calculate targetwidthandheightin "smart resize".The central issue was the resizing of positional embeddings. Siglip2 implementation in PyTorch uses
F.interpolate(..., mode="bilinear", align_corners=False, antialias=True). Antialiasing only contributes during downscaling. When the image width or height is less than 256, the scaling of positional embeddings inllama.cppproduced numerically different results from PyTorch.A new flag, ' GGML_SCALE_FLAG_ANTIALIAS', has been added for the upscale function, with implementations for CPU and CUDA.

Now outputs match:
PyTorch (fp32)
this PR (
bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf -n 64 -t 4 --image /data/playground/issue_17290/siglip_1024.png -p "OCR." --temp 0.0 --top-k 1)