Skip to content

fix wrong template in GLM4-0414 #13099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

Conversation

matteoserva
Copy link
Contributor

@matteoserva matteoserva commented Apr 24, 2025

GLM4-0414 models were using the wrong legacy template, leading to a missing [gMASK]<sop> preamble.
The old code was returning LLM_CHAT_TEMPLATE_GLMEDGE

As a workaround you needed to launch llama-server with --chat-template chatglm4

After the patch the gguf should be regenerated or edited manually.

EDIT:
Added workaround explanation

@sultanqasim
Copy link

Perhaps the CHATGML constants should be renamed CHATGLM?

@matteoserva matteoserva marked this pull request as draft April 24, 2025 16:34
@matteoserva
Copy link
Contributor Author

Perhaps the CHATGML constants should be renamed CHATGLM?

I don't want to touch code that is outside of this PR

@matteoserva matteoserva marked this pull request as ready for review April 24, 2025 17:02
@github-actions github-actions bot added the python python script changes label Apr 24, 2025
@kallewoof
Copy link
Contributor

+1 - see SillyTavern/SillyTavern#3906

@diaopal
Copy link

diaopal commented Apr 25, 2025

Why did you change the bos token to <|endoftext|>?

@kallewoof
Copy link
Contributor

Why did you change the bos token to <|endoftext|>?

Agree, BOS token should not change.

Also I am not seeing the BOS token added when using the model, as you would expect when comparing to e.g. AutoTokenizer:

slot update_slots: id  0 | task 0 | prompt token   0: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   1:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   2: 151335 '<|system|>'
slot update_slots: id  0 | task 0 | prompt token   3:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   4:   2610 'You'
slot update_slots: id  0 | task 0 | prompt token   5:   2299 ''re'
slot update_slots: id  0 | task 0 | prompt token   6:  47773 ' Coding'
slot update_slots: id  0 | task 0 | prompt token   7:  45950 ' Sense'

vs transformers:

>>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"ye?"}], tokenize=False)
'[gMASK]<sop><|user|>\nhi<|assistant|>\nye?'

Should add_bos_token be set to true? What is the reason for it not being set? Sorry if I missed that discussion.

@matteoserva
Copy link
Contributor Author

  1. You can either use add_bos_token=true or have the bos token in the template, not both. Otherwise you get a duplicate bos tokens.
  2. The jinja template already includes the bos token. For compatibility with the jinja one, the legacy template (LLM_CHAT_TEMPLATE_CHATGML_4) also adds the bos token.
    This is why add_bos_token is false.
  3. The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

@kallewoof You have to regenerate the gguf with the patched python script and also run it with patched llama-server.

Correct behavior after patch and regenerating the GGUF:

slot update_slots: id  0 | task 0 | prompt token   0: 151331 '[gMASK]'
slot update_slots: id  0 | task 0 | prompt token   1: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   2: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   3:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   4:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   5: 151337 '<|assistant|>'

Before patch without --jinja:

slot update_slots: id  0 | task 0 | prompt token   0: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   1:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   2:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   3: 151337 '<|assistant|>'

Before patch with --jinja:

slot update_slots: id  0 | task 0 | prompt token   0: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   1: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   2:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   3:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   4: 151337 '<|assistant|>'

@kallewoof
Copy link
Contributor

kallewoof commented Apr 25, 2025

  1. You can either use add_bos_token=true or have the bos token in the template, not both. Otherwise you get a duplicate bos tokens.
  2. The jinja template already includes the bos token. For compatibility with the jinja one, the legacy template (LLM_CHAT_TEMPLATE_CHATGML_4) also adds the bos token.
    This is why add_bos_token is false.
  3. The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

@kallewoof You have to regenerate the gguf with the patched python script and also run it with patched llama-server.

Aha, thanks. I did indeed regenerate the GGUF and did not get the expected results. I am using llama-server though, and connecting to it from a different front end. The prompt being sent is

<sop><system>
sysmessage
...

but from what you are saying it looks like I should not include <sop> either. But I'm not sure how this relates to llama-server and requests via the API. I assumed it would work the same, but I don't think it is. I'll double check though.

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

@matteoserva
Copy link
Contributor Author

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

Yes, the change affects the chat completion api. (open webUI and similar frontends)
In the completion mode you have to begin the prompt with [gMASK]<sop> as specified in the official template.

@Mushoz
Copy link

Mushoz commented Apr 25, 2025

slot update_slots: id 0 | task 0 | prompt token 0: 151331 '[gMASK]'
slot update_slots: id 0 | task 0 | prompt token 1: 151333 ''
slot update_slots: id 0 | task 0 | prompt token 2: 151336 '<|user|>'
slot update_slots: id 0 | task 0 | prompt token 3: 198 '
'
slot update_slots: id 0 | task 0 | prompt token 4: 14978 'hello'
slot update_slots: id 0 | task 0 | prompt token 5: 151337 '<|assistant|>'

Sorry for the slight offtopic, but this is really interesting to read! Can you please teach me how you can inspect what the actual template that was used looks like? Is this llama-server's output with some additional switches? There have been many LLM releases with broken / incorrect templates, so it would be a really useful skill to have to be able to inspect what tokens the LLM actually ended up processing as opposed to looking at the static template. Thanks!

Edit: Found it myself! For those interested, add --verbose flag to llama-server

@Mushoz
Copy link

Mushoz commented Apr 25, 2025

One more thing, I also found this output while investigating template issue and this proposed fix. I am not sure if this means the tokenizer has a bug that needs fixing, but I just wanted to put this out in case nobody else has noticed it so far:

llama.cpp-1  | init_tokenizer: initializing tokenizer for type 2
llama.cpp-1  | load: control token: 151342 '<|end_of_video|>' is not marked as EOG
llama.cpp-1  | load: control token: 151341 '<|begin_of_video|>' is not marked as EOG
llama.cpp-1  | load: control token: 151338 '<|observation|>' is not marked as EOG
llama.cpp-1  | load: control token: 151333 '<sop>' is not marked as EOG
llama.cpp-1  | load: control token: 151331 '[gMASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151330 '[MASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151337 '<|assistant|>' is not marked as EOG
llama.cpp-1  | load: control token: 151332 '[sMASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151334 '<eop>' is not marked as EOG
llama.cpp-1  | load: control token: 151335 '<|system|>' is not marked as EOG
llama.cpp-1  | load: control token: 151336 '<|user|>' is not marked as EOG
llama.cpp-1  | load: control token: 151340 '<|end_of_image|>' is not marked as EOG
llama.cpp-1  | load: control token: 151339 '<|begin_of_image|>' is not marked as EOG
llama.cpp-1  | load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

@kallewoof
Copy link
Contributor

kallewoof commented Apr 25, 2025

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

Yes, the change affects the chat completion api. (open webUI and similar frontends) In the completion mode you have to begin the prompt with [gMASK]<sop> as specified in the official template.

That works. The one thing I would suggest is to enable add_bos_token and to remove the BOS token from the chat template. We now have:

  1. Models that do not take a BOS token at the start.
  2. Models that take a BOS token at the start, which is added by llama.cpp during chat completion and during completion.
  3. Models that take a BOS token at the start, which is added by llama.cpp during chat completion but not during completion.

The alternative would be to never include the BOS token for completion, but there's the backwards compatibility issue there...

Edit: another alternative would be to add a assume_bos_start flag which will put a BOS token at the start if none is found. That would address the compatibility issues.

@matteoserva
Copy link
Contributor Author

That works. The one thing I would suggest is to enable add_bos_token and to remove the BOS token from the chat template.

Yes, this is the ideal outcome. The problem is that we don't decide the chat template, it's provided by the company that releases the model. There are discussions here: #13021 (comment) or here THUDM/GLM-Edge#2

@Mushoz I noticed that as well but ignored it as outside of the scope of this PR, If needed I'll open another PR.

@kallewoof
Copy link
Contributor

kallewoof commented Apr 25, 2025

Yes, this is the ideal outcome. The problem is that we don't decide the chat template, it's provided by the company that releases the model. There are discussions here: #13021 (comment) or here THUDM/GLM-Edge#2

Thanks for the context. Yes, I understand you don't have control over the chat template -- my suggestion is to modify it to handle this case, either for GLM-4 specifically or for any case where the chat template starts with the BOS token explicitly. I don't see a reason why llama.cpp should juggle both cases when a one time snip can be done trivially.

@diaopal
Copy link

diaopal commented Apr 25, 2025

I agree. There shouldn't be a need to change the BOS token just to make the Jinja template (--jinja) work. If anything, the fact that the Jinja template doesn't just work suggests that there's a bug that's been overlooked, patched over with hacky solutions like changing the BOS token.

The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

Why can't the BOS token serve as the prefix? At that point, you're not even using the original Jinja template if it's being silently modified on the fly like that. Since add_bos_token isn't enabled for this model, there wouldn't be any issues anyway.

Also, I realized this PR makes it as if you were specifying --chat-template chatglm4 in llama-server.

@matteoserva
Copy link
Contributor Author

I agree. There shouldn't be a need to change the BOS token just to make the Jinja template (--jinja) work. If anything, the fact that the Jinja template doesn't just work suggests that there's a bug that's been overlooked, patched over with hacky solutions like changing the BOS token.

I think it's the opposite?

The previous value of the bos token was a hack introduced in #13021 to fix a crash. It fixed the issue but it introduced new bugs.

I found out that the problems were not caused by the bos token but by the wrong chat template being loaded. So this PR removes the bos token hack and it fixes the chat template loading. This should more closely match the official config: https://huggingface.co/THUDM/GLM-4-32B-0414/blob/main/tokenizer_config.json
which is no bos token and add_bos_token=false
In GLM4 <|endoftext|> is a padding token so it should be appropriate to be put in place of the bos token

@moo-v
Copy link

moo-v commented Apr 25, 2025

Is their template even correct? chat_template.jinja has newlines before the roles, and the chat_template in tokenizer_config.json doesn't.

@matteoserva
Copy link
Contributor Author

From my understanding the chat_template.jinja has extra formatting to make it human readable but it's ignored by the code.

The template in tokenizer_config.json is the correct one. The only problem I see is that it's missing the final newline when add_generation_prompt=true, so it's instead inserted at the beginning of each assistant response, breaking multi turn conversations.

@@ -5153,7 +5153,7 @@ def set_vocab(self):
special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
special_vocab._set_special_token("unk", tokenizer.get_added_vocab()["<|endoftext|>"])
special_vocab._set_special_token("bos", tokenizer.get_added_vocab()["[gMASK]"])
special_vocab._set_special_token("bos", tokenizer.get_added_vocab()["<|endoftext|>"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing BOS, I think what should be done is to set add_add_bos(False) to prevent tokenize function from adding BOS to the sentence

@@ -155,7 +158,7 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) {
} else if (tmpl_contains("[gMASK]sop")) {
// chatglm3-6b
return LLM_CHAT_TEMPLATE_CHATGML_3;
} else if (tmpl_contains("[gMASK]<sop>")) {
} else if (tmpl_contains("[gMASK]<sop>")) { /* old GLM4 models */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to simply move this code block to above the else if (tmpl_contains("<|assistant|>") && tmpl_contains("<|user|>")), so the condition for GLM will have higher priority

@Mushoz
Copy link

Mushoz commented Apr 27, 2025

Why was this closed? Is this no longer needed?

@matteoserva
Copy link
Contributor Author

matteoserva commented Apr 27, 2025

The PR is still needed. I launched the wrong git command and broke my repo. I reopened the pr here: #13140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants