UPSTREAM PR #17577: model: LFM2-VL fixes #350

loci-dev · 2025-11-28T16:39:37Z

Debugging of #17290 revealed multiple issues with LFM2-VL.

This PR fixes the following issues and makes the output of llama.cpp equivalent to PyTorch

surround image embeddings with <|image_start|> and <|image_end|> tokens
use round_by_factor to calculate target width and height in "smart resize".
stretch the image to the width and height calculated by smart resize instead of padding
place image embeddings before user prompt
resize positional embedding with antialiasing enabled

The central issue was the resizing of positional embeddings. Siglip2 implementation in PyTorch uses F.interpolate(..., mode="bilinear", align_corners=False, antialias=True). Antialiasing only contributes during downscaling. When the image width or height is less than 256, the scaling of positional embeddings in llama.cpp produced numerically different results from PyTorch.

A new flag, ' GGML_SCALE_FLAG_ANTIALIAS', has been added for the upscale function, with implementations for CPU and CUDA.
Now outputs match:

PyTorch (fp32)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

this PR (bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf -n 64 -t 4 --image /data/playground/issue_17290/siglip_1024.png -p "OCR." --temp 0.0 --top-k 1)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

loci-agentic-ai · 2025-11-28T17:20:14Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #350 - LFM2-VL Antialiasing Implementation

Overview

PR #350 implements antialiased bilinear interpolation for LFM2-VL vision model positional embedding resizing to match PyTorch numerical output. The changes add 165 lines across 8 files, introducing a new GGML_SCALE_FLAG_ANTIALIAS flag with CPU and CUDA backend implementations.

Key Findings

Performance-Critical Function Impact

libggml-cpu.so upscale operator (ops.cpp:7467)

Response time increased by 87 ns (49 ns → 135 ns)
Throughput time increased by 40 ns (49 ns → 88 ns)
The new antialiased path implements triangle filter weighting with variable-size neighborhood sampling
During downscaling (scale < 1.0), support region exceeds 1 pixel, requiring multiple source pixel samples per output pixel
The implementation adds nested loops for weighted accumulation, per-pixel triangle filter evaluations, and normalization

libggml-cpu.so second upscale operator (ops.cpp:7526)

Response time decreased by 20 ns (69 ns → 49 ns)
Throughput time decreased by 20 ns (69 ns → 49 ns)
This improvement partially offsets the regression in the first operator

Inference Impact

Tokens per Second: No Impact

The core inference functions (llama_decode, llama_encode, llama_tokenize) show no changes in response time or throughput. The upscale operations affected by this PR execute during vision encoder preprocessing for positional embedding resizing, not during token generation. Based on the reference that 2 ms slower llama_decode results in 7% fewer tokens per second, the 87 ns increase in upscale operations represents 0.004% of that threshold and has negligible impact on inference throughput.

The changes affect vision model preprocessing only, specifically when processing images smaller than 256 pixels where positional embeddings require downscaling. Token generation speed remains unchanged.

Power Consumption Analysis

libggml-cpu.so: Power consumption increased by 1.27% (+1,464 nJ)

Driven by the antialiased bilinear implementation requiring additional computation for triangle filter weighting and multi-pixel sampling

libmtmd.so: Power consumption increased by 1.05% (+1,361 nJ)

Attributed to synchronization primitive changes (mutex, semaphore operations showing +20-23 ns increases) unrelated to this PR

All other binaries (libggml-base.so, llama-bench, llama-run, libllama.so, etc.) show zero power consumption change.

Implementation Context

The performance regression is a correctness-focused trade-off. The antialiased implementation matches PyTorch's F.interpolate(..., mode="bilinear", antialias=True) behavior, which is required for LFM2-VL numerical accuracy. The antialiasing path only activates during downscaling operations when the GGML_SCALE_FLAG_ANTIALIAS flag is set. Most vision model inference involves upscaling positional embeddings, where the support region remains 1 pixel and performance impact is minimal.

loci-agentic-ai · 2025-11-30T11:17:57Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #350

PR Context: Introduces antialiasing support for bilinear interpolation in vision model processing (LFM2-VL fixes). Changes span 12 files with 162 additions and 13 deletions.

Overview

This PR adds a new GGML_SCALE_FLAG_ANTIALIAS flag and implements triangle-filter-based antialiasing for bilinear upscaling operations. The implementation targets numerical accuracy for LFM2-VL vision model positional embedding resizing to match PyTorch reference behavior.

Key Findings

Most-Impacted Functions

1. Upscale Lambda Functions (libggml-cpu.so)

The primary regression occurs in ggml_compute_forward_upscale_f32 lambda operators within ggml/src/ggml-cpu/ops.cpp:

Lambda operator (line 7467): Response time increased by 87 ns (from 49 ns to 135 ns), throughput increased by 40 ns (from 49 ns to 88 ns)
Lambda operator (line 7526): Response time decreased by 20 ns (from 69 ns to 49 ns), throughput decreased by 20 ns (improvement)

The regression stems from the new antialiasing code path that replaces simple 4-point bilinear sampling with variable-range triangle-filtered sampling. The new implementation adds 59 lines of code including nested loops with dynamic range calculation, triangle filter weight computation, and weighted pixel accumulation. For downscaling operations, the sampling region expands from a fixed 2x2 grid to a variable NxN grid based on scale factor, resulting in 8-10x instruction count increase per output pixel.

2. STL Container Accessors (libmtmd.so)

std::vector::cbegin: Response time increased by 111 ns (from 84 ns to 195 ns), throughput increased by 132 ns (from 62 ns to 195 ns)
std::vector::end: Response time decreased by 114 ns (from 195 ns to 81 ns), throughput decreased by 135 ns (improvement)

These changes reflect compiler inlining inconsistencies unrelated to the PR's functional changes. The cbegin function now includes explicit function prologue/epilogue overhead, while end benefits from successful inlining.

3. Miniaudio I/O Functions (libmtmd.so)

Multiple I/O callback functions show 17-22 ns response time increases:

ma_dr_flac__on_tell_stdio: +31 ns response time
ma_default_vfs_close__stdio: +15 ns response time
ma_dr_mp3__on_read: +32 ns response time
ma_dr_flac__on_tell_memory: +24 ns response time

These regressions appear to stem from enhanced validation or error handling in the VFS abstraction layer, with 80-92% of execution time in the functions' own logic rather than actual I/O operations.

Inference Performance Impact

Tokens per Second: No impact expected. The affected functions (ggml_compute_forward_upscale_f32 and related vision processing operations) are not in the tokenization or text inference path. Functions like llama_decode, llama_encode, and llama_tokenize show no changes in this PR. The upscale operation executes once per image during vision model preprocessing, not during token generation. Text-only inference workloads remain unaffected.

Power Consumption Analysis

Impacted Binaries:

libggml-cpu.so: Power consumption increased by 1,464 nJ (+1.27%), from 115,347 nJ to 116,811 nJ. The increase is driven by the upscale operation regressions (87 ns response time increase) and cross-entropy loss computation (+12 ns throughput).
libmtmd.so: Power consumption increased by 1,252 nJ (+0.96%), from 130,247 nJ to 131,499 nJ. Primary contributors include STL container operations (cbegin +132 ns throughput) and audio decoding functions (ma_dr_flac__read_int8 +91 ns throughput), partially offset by the end improvement (-135 ns throughput).

Combined power regression across both binaries totals 2,716 nJ, representing a 1.1% increase in cumulative execution energy for vision and audio processing operations.

Code Change Analysis

The antialiasing implementation is functionally correct and achieves the stated goal of matching PyTorch's F.interpolate(..., mode="bilinear", antialias=True) behavior. The algorithm uses a triangle filter with dynamic support region calculation: for upscaling (scale factor > 1.0), support remains at 1.0 pixel (standard bilinear); for downscaling (scale factor < 1.0), support expands to 1.0/scale_factor pixels to prevent aliasing artifacts.

Backend support is implemented for CPU and CUDA, while Metal, SYCL, OpenCL, Vulkan, and CANN explicitly reject operations with the antialias flag, forcing CPU fallback. The CUDA implementation mirrors the CPU algorithm with one thread per output pixel.

Additional changes include smart resize calculation switching from ceil_by_factor to round_by_factor, LFM2 projector disabling padding in favor of stretching, and updated image boundary tokens (<|image_start|> and <|image_end|>).

The performance regression is acceptable given the operation's infrequency (once per image during preprocessing) and the correctness requirement for vision model numerical accuracy.

loci-dev temporarily deployed to PROD__AL_DEMO November 28, 2025 16:39 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 10 times, most recently from e4a4e1d to d0b408b Compare November 30, 2025 02:46

tdakhran added 13 commits November 30, 2025 11:27

Adjust to pytorch

2386891

Add antialiasing upscale

c509073

Increase number of patches to 1024

80b4e97

Handle default marker insertion for LFM2

1cd4e2f

Switch to flag

40e08b8

Reformat

65789e5

Cuda implementation of antialias kernel

7cf67d6

Change placement in ops.cpp

7c8b098

consistent float literals

3ea706e

Pad only for LFM2

0b14906

Address PR feedback

b81928f

Rollback default marker placement changes

31be1a9

Fallback to CPU implementation for antialias implementation of upscale

2385ecf

loci-dev force-pushed the upstream-PR17577-branch_Liquid4All-tarek/feat/upstream_17290 branch from 50ba22e to 2385ecf Compare November 30, 2025 10:36

loci-dev temporarily deployed to PROD__AL_DEMO November 30, 2025 10:36 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from f96421a to 1854a53 Compare November 30, 2025 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17577: model: LFM2-VL fixes #350

UPSTREAM PR #17577: model: LFM2-VL fixes #350

loci-dev commented Nov 28, 2025

Uh oh!

loci-agentic-ai bot commented Nov 28, 2025

Uh oh!

loci-agentic-ai bot commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17577: model: LFM2-VL fixes #350

Are you sure you want to change the base?

UPSTREAM PR #17577: model: LFM2-VL fixes #350

Conversation

loci-dev commented Nov 28, 2025

Uh oh!

loci-agentic-ai bot commented Nov 28, 2025

Performance Analysis Summary: PR #350 - LFM2-VL Antialiasing Implementation

Overview

Key Findings

Performance-Critical Function Impact

Inference Impact

Power Consumption Analysis

Implementation Context

Uh oh!

loci-agentic-ai bot commented Nov 30, 2025

Performance Analysis Summary: PR #350

Overview

Key Findings

Most-Impacted Functions

Inference Performance Impact

Power Consumption Analysis

Code Change Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants