UPSTREAM PR #17597: Load Sliding Window Attention (SWA) pattern from GGUF metadata #367

loci-dev · 2025-11-29T18:40:16Z

The SWA pattern key was originally introduced in #14400.

This PR changes the behavior so that llama.cpp now reads the sliding window attention pattern directly from the GGUF file (e.g. attention.sliding_window_pattern metadata when present).

This brings the following benefits:

Allows newer models to ship their correct SWA pattern without requiring a code change in llama.cpp
Keeps backward compatibility with existing models
Still permits specific model overrides in code (as before) for cases where the metadata is absent or needs to be corrected

The loading precedence is:

Model-specific hard-coded override (if any)
Value from GGUF metadata

loci-agentic-ai · 2025-11-29T19:23:48Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Overview

This PR introduces metadata-driven Sliding Window Attention (SWA) pattern loading from GGUF files. The changes span 5 files with 12 additions, enabling per-layer SWA configuration through model metadata rather than hardcoded values.

Key Findings

Performance-Critical Areas Impact

Model Loading Path:

llama_model::load_hparams() adds metadata lookups for SWA configuration
Added operations: one key lookup, one conditional branch, one array load, two assignments
Estimated overhead: 30-60 ns per model load
This is a cold path executed once during initialization

Inference Path:

No direct modifications to tokenization or inference functions
llama_decode, llama_encode, llama_tokenize remain unchanged
Per-layer SWA checks use simple array access: hparams.swa_layers[layer_idx]
Estimated per-layer overhead: 1-2 ns for array lookup

Tokens Per Second Impact

Analysis:
The reference shows 7% tokens per second reduction correlates with 2 ms slower llama_decode.

This PR:

No modifications to llama_decode, llama_encode, or llama_tokenize
Added logic is in model loading, not inference hot path
Per-layer SWA pattern checks add ~50-100 ns total across all layers per forward pass
Expected tokens per second impact: <0.001% (negligible)

Impacted Functions:

None of the core inference functions show measurable changes
Model loading functions affected but not performance-critical for inference

Power Consumption Analysis

Binary-Level Impact:

Changes affect llama-cli and llama-server binaries through linked model loading code
Added code: template instantiation, metadata parsing, conditional logic
Estimated power impact: <0.01% increase in model loading phase
No measurable impact during inference phase
Impacted binaries: All binaries linking llama-model-loader and llama-model modules

Conclusion:
The implementation adds necessary flexibility for SWA configuration with negligible performance impact. The changes are isolated to model initialization, leaving inference paths unaffected. No measurable degradation in tokens per second or power consumption during inference is expected.

taylorchu added 2 commits November 29, 2025 09:35

read swa pattern for all models

29878b2

add missing template

5a0c228

loci-dev temporarily deployed to PROD__AL_DEMO November 29, 2025 18:40 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 6 times, most recently from 1854a53 to 1b177fe Compare November 30, 2025 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17597: Load Sliding Window Attention (SWA) pattern from GGUF metadata #367

UPSTREAM PR #17597: Load Sliding Window Attention (SWA) pattern from GGUF metadata #367

Uh oh!

loci-dev commented Nov 29, 2025

Uh oh!

loci-agentic-ai bot commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17597: Load Sliding Window Attention (SWA) pattern from GGUF metadata #367

Are you sure you want to change the base?

UPSTREAM PR #17597: Load Sliding Window Attention (SWA) pattern from GGUF metadata #367

Uh oh!

Conversation

loci-dev commented Nov 29, 2025

Uh oh!

loci-agentic-ai bot commented Nov 29, 2025

Performance Analysis Summary

Overview

Key Findings

Performance-Critical Areas Impact

Tokens Per Second Impact

Power Consumption Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants