Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17597

The SWA pattern key was originally introduced in #14400.

This PR changes the behavior so that llama.cpp now reads the sliding window attention pattern directly from the GGUF file (e.g. attention.sliding_window_pattern metadata when present).

This brings the following benefits:

  • Allows newer models to ship their correct SWA pattern without requiring a code change in llama.cpp
  • Keeps backward compatibility with existing models
  • Still permits specific model overrides in code (as before) for cases where the metadata is absent or needs to be corrected

The loading precedence is:

  • Model-specific hard-coded override (if any)
  • Value from GGUF metadata

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Overview

This PR introduces metadata-driven Sliding Window Attention (SWA) pattern loading from GGUF files. The changes span 5 files with 12 additions, enabling per-layer SWA configuration through model metadata rather than hardcoded values.

Key Findings

Performance-Critical Areas Impact

Model Loading Path:

  • llama_model::load_hparams() adds metadata lookups for SWA configuration
  • Added operations: one key lookup, one conditional branch, one array load, two assignments
  • Estimated overhead: 30-60 ns per model load
  • This is a cold path executed once during initialization

Inference Path:

  • No direct modifications to tokenization or inference functions
  • llama_decode, llama_encode, llama_tokenize remain unchanged
  • Per-layer SWA checks use simple array access: hparams.swa_layers[layer_idx]
  • Estimated per-layer overhead: 1-2 ns for array lookup

Tokens Per Second Impact

Analysis:
The reference shows 7% tokens per second reduction correlates with 2 ms slower llama_decode.

This PR:

  • No modifications to llama_decode, llama_encode, or llama_tokenize
  • Added logic is in model loading, not inference hot path
  • Per-layer SWA pattern checks add ~50-100 ns total across all layers per forward pass
  • Expected tokens per second impact: <0.001% (negligible)

Impacted Functions:

  • None of the core inference functions show measurable changes
  • Model loading functions affected but not performance-critical for inference

Power Consumption Analysis

Binary-Level Impact:

  • Changes affect llama-cli and llama-server binaries through linked model loading code
  • Added code: template instantiation, metadata parsing, conditional logic
  • Estimated power impact: <0.01% increase in model loading phase
  • No measurable impact during inference phase
  • Impacted binaries: All binaries linking llama-model-loader and llama-model modules

Conclusion:
The implementation adds necessary flexibility for SWA configuration with negligible performance impact. The changes are isolated to model initialization, leaving inference paths unaffected. No measurable degradation in tokens per second or power consumption during inference is expected.

@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 1854a53 to 1b177fe Compare November 30, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants