Skip to content

Conversation

@taylorchu
Copy link

@taylorchu taylorchu commented Nov 29, 2025

The SWA pattern key was originally introduced in #14400.

This PR changes the behavior so that llama.cpp now reads the sliding window attention pattern directly from the GGUF file (e.g. attention.sliding_window_pattern metadata when present).

This brings the following benefits:

  • Allows newer models to ship their correct SWA pattern without requiring a code change in llama.cpp
  • Keeps backward compatibility with existing models
  • Still permits specific model overrides in code (as before) for cases where the metadata is absent or needs to be corrected

The loading precedence is:

  • Model-specific hard-coded override (if any)
  • Value from GGUF metadata

@taylorchu
Copy link
Author

taylorchu commented Nov 29, 2025

The tests pass locally.

@CISC
Copy link
Collaborator

CISC commented Nov 29, 2025

I'm not sure why the sliding_window_pattern metadata was introduced in that PR (@ngxson?) and never used, but it was most likely a mistake, the same thing can be achieved by adding hparams.n_swa_pattern as in #15641 that can be passed to set_swa_pattern.

@ngxson
Copy link
Collaborator

ngxson commented Nov 29, 2025

@CISC It was initially made to reflect the layer_types in the config: https://huggingface.co/google/gemma-3n-E2B-it/blob/main/config.json#L140

IIRC it was a last-minute change because ollama engine requires this KV (so that users can freely use the same GGUF on either ollama or llama.cpp)

Around that time, there was another model was also using this metadata, but it was never made to the public. So does the code to handle it.

So honestly I don't have strong opinion either we should merge this PR or not. It seems like adding it doesn't breaking anything anw (because for most models, the pattern will be overwritten with set_swa_pattern), so probably fine to merge and wait until some models actually use it.

@taylorchu
Copy link
Author

I was thinking between:

  1. set_swa_pattern with that metadata
  2. just passing the raw swa_types

I choose option 2 because swa_types already exists, and we don't have to choose full attention first or last as a separate param.

@ngxson
Copy link
Collaborator

ngxson commented Nov 29, 2025

realistically, there is no models to this day use more than 1 type of swa_type (excluding the NONE value)

so swa_types is not a good option

@taylorchu
Copy link
Author

yeah the type is bool (sliding attention or full attention). Only LLM_ARCH_LLAMA4 uses chunked attention. It seems like the trend is linear attention instead of swa in the future, so I am going to keep bool as is instead of enum.


ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa, false);
if (hparams.n_swa > 0) {
hparams.swa_type = LLAMA_SWA_TYPE_STANDARD;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can assume anything about swa_type type here. Please remove this line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is required; otherwise swa won't activate. This similar structure is used below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. how does swa not get activate?

swa_type should be set per-model. Please remove this line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        case LLAMA_SWA_TYPE_STANDARD:
            {
                if (p1 - p0 >= (int32_t) n_swa) {
                    return true;
                }
            } break;

swa requires specific masking to work. Without it, it will generate gibberish. The default semantic "standard". I think specific models can set to chunked or symmetric.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 562 to 564
} else {
hparams.swa_type = LLAMA_SWA_TYPE_NONE;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this branch too, it's redundant

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this because it is LLAMA_SWA_TYPE_NONE by default, but it seems like LLAMA_SWA_TYPE_NONE is added too below. I added it just in case.

@ngxson
Copy link
Collaborator

ngxson commented Nov 29, 2025

Can also test this PR with one of the models using SWA (for example, gemma3n) to confirm if it does not break anything?

@ngxson
Copy link
Collaborator

ngxson commented Nov 29, 2025

hmm, on second thought, I think swa_layers should only be read by model that use it, the same with existing n_swa logic.

I don't quite like this change as it read everything at global level. this affects all models and potentially break things, because we cannot test all GGUF in the wild

so unless you can say otherwise, I prefer closing this PR and only implement this when a new model actually need it

@taylorchu
Copy link
Author

I tested with gemma3n: https://huggingface.co/unsloth/gemma-3n-E2B-it-GGUF/blob/main/gemma-3n-E2B-it-Q4_0.gguf

llama_model_loader: - kv  23:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  24:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  25:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  26:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  27:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  28:         gemma3n.attention.shared_kv_layers u32              = 10
llama_model_loader: - kv  29:          gemma3n.activation_sparsity_scale arr[f32,30]      = [1.644854, 1.644854, 1.644854, 1.6448...
llama_model_loader: - kv  30:   gemma3n.attention.sliding_window_pattern arr[bool,30]     = [true, true, true, true, false, true,...
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal, is_swa = 1
load_tensors: layer   1 assigned to device Metal, is_swa = 1
load_tensors: layer   2 assigned to device Metal, is_swa = 1
load_tensors: layer   3 assigned to device Metal, is_swa = 1
load_tensors: layer   4 assigned to device Metal, is_swa = 0
load_tensors: layer   5 assigned to device Metal, is_swa = 1
load_tensors: layer   6 assigned to device Metal, is_swa = 1
load_tensors: layer   7 assigned to device Metal, is_swa = 1
load_tensors: layer   8 assigned to device Metal, is_swa = 1
load_tensors: layer   9 assigned to device Metal, is_swa = 0
load_tensors: layer  10 assigned to device Metal, is_swa = 1
load_tensors: layer  11 assigned to device Metal, is_swa = 1
load_tensors: layer  12 assigned to device Metal, is_swa = 1
load_tensors: layer  13 assigned to device Metal, is_swa = 1
load_tensors: layer  14 assigned to device Metal, is_swa = 0
load_tensors: layer  15 assigned to device Metal, is_swa = 1
load_tensors: layer  16 assigned to device Metal, is_swa = 1
load_tensors: layer  17 assigned to device Metal, is_swa = 1
load_tensors: layer  18 assigned to device Metal, is_swa = 1
load_tensors: layer  19 assigned to device Metal, is_swa = 0
load_tensors: layer  20 assigned to device Metal, is_swa = 1
load_tensors: layer  21 assigned to device Metal, is_swa = 1
load_tensors: layer  22 assigned to device Metal, is_swa = 1
load_tensors: layer  23 assigned to device Metal, is_swa = 1
load_tensors: layer  24 assigned to device Metal, is_swa = 0
load_tensors: layer  25 assigned to device Metal, is_swa = 1
load_tensors: layer  26 assigned to device Metal, is_swa = 1
load_tensors: layer  27 assigned to device Metal, is_swa = 1
load_tensors: layer  28 assigned to device Metal, is_swa = 1
load_tensors: layer  29 assigned to device Metal, is_swa = 0
load_tensors: layer  30 assigned to device Metal, is_swa = 0

@taylorchu
Copy link
Author

taylorchu commented Nov 29, 2025

This should be low-risk because we only read swa_pattern from metadata if n_swa > 0. If the swa_pattern metadata is not present, it is false (full attention).

@taylorchu
Copy link
Author

@ngxson I technically only need qwen3. I updated the pr to address your concern.

@taylorchu taylorchu changed the title Load Sliding Window Attention (SWA) pattern from GGUF metadata Load Sliding Window Attention (SWA) pattern from GGUF metadata (qwen3) Nov 30, 2025
@github-actions github-actions bot added the model Model specific label Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants