-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Load Sliding Window Attention (SWA) pattern from GGUF metadata (qwen3) #17597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The tests pass locally. |
|
@CISC It was initially made to reflect the IIRC it was a last-minute change because ollama engine requires this KV (so that users can freely use the same GGUF on either ollama or llama.cpp) Around that time, there was another model was also using this metadata, but it was never made to the public. So does the code to handle it. So honestly I don't have strong opinion either we should merge this PR or not. It seems like adding it doesn't breaking anything anw (because for most models, the pattern will be overwritten with |
|
I was thinking between:
I choose option 2 because swa_types already exists, and we don't have to choose full attention first or last as a separate param. |
|
realistically, there is no models to this day use more than 1 type of so |
|
yeah the type is bool (sliding attention or full attention). Only LLM_ARCH_LLAMA4 uses chunked attention. It seems like the trend is linear attention instead of swa in the future, so I am going to keep bool as is instead of enum. |
src/llama-model.cpp
Outdated
|
|
||
| ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa, false); | ||
| if (hparams.n_swa > 0) { | ||
| hparams.swa_type = LLAMA_SWA_TYPE_STANDARD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can assume anything about swa_type type here. Please remove this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is required; otherwise swa won't activate. This similar structure is used below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. how does swa not get activate?
swa_type should be set per-model. Please remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case LLAMA_SWA_TYPE_STANDARD:
{
if (p1 - p0 >= (int32_t) n_swa) {
return true;
}
} break;
swa requires specific masking to work. Without it, it will generate gibberish. The default semantic "standard". I think specific models can set to chunked or symmetric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/llama-model.cpp
Outdated
| } else { | ||
| hparams.swa_type = LLAMA_SWA_TYPE_NONE; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this branch too, it's redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove this because it is LLAMA_SWA_TYPE_NONE by default, but it seems like LLAMA_SWA_TYPE_NONE is added too below. I added it just in case.
|
Can also test this PR with one of the models using SWA (for example, gemma3n) to confirm if it does not break anything? |
|
hmm, on second thought, I think I don't quite like this change as it read everything at global level. this affects all models and potentially break things, because we cannot test all GGUF in the wild so unless you can say otherwise, I prefer closing this PR and only implement this when a new model actually need it |
|
I tested with gemma3n: https://huggingface.co/unsloth/gemma-3n-E2B-it-GGUF/blob/main/gemma-3n-E2B-it-Q4_0.gguf |
|
This should be low-risk because we only read swa_pattern from metadata if n_swa > 0. If the swa_pattern metadata is not present, it is false (full attention). |
|
@ngxson I technically only need qwen3. I updated the pr to address your concern. |
The SWA pattern key was originally introduced in #14400.
This PR changes the behavior so that llama.cpp now reads the sliding window attention pattern directly from the GGUF file (e.g. attention.sliding_window_pattern metadata when present).
This brings the following benefits:
The loading precedence is: