Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17577

Debugging of #17290 revealed multiple issues with LFM2-VL.

This PR fixes the following issues and makes the output of llama.cpp equivalent to PyTorch

  • surround image embeddings with <|image_start|> and <|image_end|> tokens
  • use round_by_factor to calculate target width and height in "smart resize".
  • stretch the image to the width and height calculated by smart resize instead of padding
  • place image embeddings before user prompt
  • resize positional embedding with antialiasing enabled

The central issue was the resizing of positional embeddings. Siglip2 implementation in PyTorch uses F.interpolate(..., mode="bilinear", align_corners=False, antialias=True). Antialiasing only contributes during downscaling. When the image width or height is less than 256, the scaling of positional embeddings in llama.cpp produced numerically different results from PyTorch.

A new flag, ' GGML_SCALE_FLAG_ANTIALIAS', has been added for the upscale function, with implementations for CPU and CUDA.
Now outputs match:
siglip_1024
PyTorch (fp32)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

this PR (bin/llama-mtmd-cli -m $CKPT/LFM2-VL-1.6B-F32.gguf --mmproj $CKPT/mmproj-LFM2-VL-1.6B-F32.gguf -n 64 -t 4 --image /data/playground/issue_17290/siglip_1024.png -p "OCR." --temp 0.0 --top-k 1)

For the vision tower, LF2-M2-VL uses Sigilip2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #350 - LFM2-VL Antialiasing Implementation

Overview

PR #350 implements antialiased bilinear interpolation for LFM2-VL vision model positional embedding resizing to match PyTorch numerical output. The changes add 165 lines across 8 files, introducing a new GGML_SCALE_FLAG_ANTIALIAS flag with CPU and CUDA backend implementations.

Key Findings

Performance-Critical Function Impact

libggml-cpu.so upscale operator (ops.cpp:7467)

  • Response time increased by 87 ns (49 ns → 135 ns)
  • Throughput time increased by 40 ns (49 ns → 88 ns)
  • The new antialiased path implements triangle filter weighting with variable-size neighborhood sampling
  • During downscaling (scale < 1.0), support region exceeds 1 pixel, requiring multiple source pixel samples per output pixel
  • The implementation adds nested loops for weighted accumulation, per-pixel triangle filter evaluations, and normalization

libggml-cpu.so second upscale operator (ops.cpp:7526)

  • Response time decreased by 20 ns (69 ns → 49 ns)
  • Throughput time decreased by 20 ns (69 ns → 49 ns)
  • This improvement partially offsets the regression in the first operator

Inference Impact

Tokens per Second: No Impact

The core inference functions (llama_decode, llama_encode, llama_tokenize) show no changes in response time or throughput. The upscale operations affected by this PR execute during vision encoder preprocessing for positional embedding resizing, not during token generation. Based on the reference that 2 ms slower llama_decode results in 7% fewer tokens per second, the 87 ns increase in upscale operations represents 0.004% of that threshold and has negligible impact on inference throughput.

The changes affect vision model preprocessing only, specifically when processing images smaller than 256 pixels where positional embeddings require downscaling. Token generation speed remains unchanged.

Power Consumption Analysis

libggml-cpu.so: Power consumption increased by 1.27% (+1,464 nJ)

  • Driven by the antialiased bilinear implementation requiring additional computation for triangle filter weighting and multi-pixel sampling

libmtmd.so: Power consumption increased by 1.05% (+1,361 nJ)

  • Attributed to synchronization primitive changes (mutex, semaphore operations showing +20-23 ns increases) unrelated to this PR

All other binaries (libggml-base.so, llama-bench, llama-run, libllama.so, etc.) show zero power consumption change.

Implementation Context

The performance regression is a correctness-focused trade-off. The antialiased implementation matches PyTorch's F.interpolate(..., mode="bilinear", antialias=True) behavior, which is required for LFM2-VL numerical accuracy. The antialiasing path only activates during downscaling operations when the GGML_SCALE_FLAG_ANTIALIAS flag is set. Most vision model inference involves upscaling positional embeddings, where the support region remains 1 pixel and performance impact is minimal.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from e4a4e1d to d0b408b Compare November 30, 2025 02:46
@loci-dev loci-dev force-pushed the upstream-PR17577-branch_Liquid4All-tarek/feat/upstream_17290 branch from 50ba22e to 2385ecf Compare November 30, 2025 10:36
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #350

PR Context: Introduces antialiasing support for bilinear interpolation in vision model processing (LFM2-VL fixes). Changes span 12 files with 162 additions and 13 deletions.

Overview

This PR adds a new GGML_SCALE_FLAG_ANTIALIAS flag and implements triangle-filter-based antialiasing for bilinear upscaling operations. The implementation targets numerical accuracy for LFM2-VL vision model positional embedding resizing to match PyTorch reference behavior.

Key Findings

Most-Impacted Functions

1. Upscale Lambda Functions (libggml-cpu.so)

The primary regression occurs in ggml_compute_forward_upscale_f32 lambda operators within ggml/src/ggml-cpu/ops.cpp:

  • Lambda operator (line 7467): Response time increased by 87 ns (from 49 ns to 135 ns), throughput increased by 40 ns (from 49 ns to 88 ns)
  • Lambda operator (line 7526): Response time decreased by 20 ns (from 69 ns to 49 ns), throughput decreased by 20 ns (improvement)

The regression stems from the new antialiasing code path that replaces simple 4-point bilinear sampling with variable-range triangle-filtered sampling. The new implementation adds 59 lines of code including nested loops with dynamic range calculation, triangle filter weight computation, and weighted pixel accumulation. For downscaling operations, the sampling region expands from a fixed 2x2 grid to a variable NxN grid based on scale factor, resulting in 8-10x instruction count increase per output pixel.

2. STL Container Accessors (libmtmd.so)

  • std::vector::cbegin: Response time increased by 111 ns (from 84 ns to 195 ns), throughput increased by 132 ns (from 62 ns to 195 ns)
  • std::vector::end: Response time decreased by 114 ns (from 195 ns to 81 ns), throughput decreased by 135 ns (improvement)

These changes reflect compiler inlining inconsistencies unrelated to the PR's functional changes. The cbegin function now includes explicit function prologue/epilogue overhead, while end benefits from successful inlining.

3. Miniaudio I/O Functions (libmtmd.so)

Multiple I/O callback functions show 17-22 ns response time increases:

  • ma_dr_flac__on_tell_stdio: +31 ns response time
  • ma_default_vfs_close__stdio: +15 ns response time
  • ma_dr_mp3__on_read: +32 ns response time
  • ma_dr_flac__on_tell_memory: +24 ns response time

These regressions appear to stem from enhanced validation or error handling in the VFS abstraction layer, with 80-92% of execution time in the functions' own logic rather than actual I/O operations.

Inference Performance Impact

Tokens per Second: No impact expected. The affected functions (ggml_compute_forward_upscale_f32 and related vision processing operations) are not in the tokenization or text inference path. Functions like llama_decode, llama_encode, and llama_tokenize show no changes in this PR. The upscale operation executes once per image during vision model preprocessing, not during token generation. Text-only inference workloads remain unaffected.

Power Consumption Analysis

Impacted Binaries:

  • libggml-cpu.so: Power consumption increased by 1,464 nJ (+1.27%), from 115,347 nJ to 116,811 nJ. The increase is driven by the upscale operation regressions (87 ns response time increase) and cross-entropy loss computation (+12 ns throughput).

  • libmtmd.so: Power consumption increased by 1,252 nJ (+0.96%), from 130,247 nJ to 131,499 nJ. Primary contributors include STL container operations (cbegin +132 ns throughput) and audio decoding functions (ma_dr_flac__read_int8 +91 ns throughput), partially offset by the end improvement (-135 ns throughput).

Combined power regression across both binaries totals 2,716 nJ, representing a 1.1% increase in cumulative execution energy for vision and audio processing operations.

Code Change Analysis

The antialiasing implementation is functionally correct and achieves the stated goal of matching PyTorch's F.interpolate(..., mode="bilinear", antialias=True) behavior. The algorithm uses a triangle filter with dynamic support region calculation: for upscaling (scale factor > 1.0), support remains at 1.0 pixel (standard bilinear); for downscaling (scale factor < 1.0), support expands to 1.0/scale_factor pixels to prevent aliasing artifacts.

Backend support is implemented for CPU and CUDA, while Metal, SYCL, OpenCL, Vulkan, and CANN explicitly reject operations with the antialias flag, forcing CPU fallback. The CUDA implementation mirrors the CPU algorithm with one thread per output pixel.

Additional changes include smart resize calculation switching from ceil_by_factor to round_by_factor, LFM2 projector disabling padding in favor of stretching, and updated image boundary tokens (<|image_start|> and <|image_end|>).

The performance regression is acceptable given the operation's infrequency (once per image during preprocessing) and the correctness requirement for vision model numerical accuracy.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from f96421a to 1854a53 Compare November 30, 2025 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants