fix wrong template in GLM4-0414 #13099

matteoserva · 2025-04-24T15:36:45Z

GLM4-0414 models were using the wrong legacy template, leading to a missing [gMASK]<sop> preamble.
The old code was returning LLM_CHAT_TEMPLATE_GLMEDGE

As a workaround you needed to launch llama-server with --chat-template chatglm4

After the patch the gguf should be regenerated or edited manually.

EDIT:
Added workaround explanation

sultanqasim · 2025-04-24T15:41:02Z

Perhaps the CHATGML constants should be renamed CHATGLM?

matteoserva · 2025-04-24T16:35:02Z

Perhaps the CHATGML constants should be renamed CHATGLM?

I don't want to touch code that is outside of this PR

kallewoof · 2025-04-25T03:55:04Z

+1 - see SillyTavern/SillyTavern#3906

diaopal · 2025-04-25T04:51:13Z

Why did you change the bos token to <|endoftext|>?

kallewoof · 2025-04-25T05:09:15Z

Why did you change the bos token to <|endoftext|>?

Agree, BOS token should not change.

Also I am not seeing the BOS token added when using the model, as you would expect when comparing to e.g. AutoTokenizer:

slot update_slots: id  0 | task 0 | prompt token   0: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   1:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   2: 151335 '<|system|>'
slot update_slots: id  0 | task 0 | prompt token   3:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   4:   2610 'You'
slot update_slots: id  0 | task 0 | prompt token   5:   2299 ''re'
slot update_slots: id  0 | task 0 | prompt token   6:  47773 ' Coding'
slot update_slots: id  0 | task 0 | prompt token   7:  45950 ' Sense'

vs transformers:

>>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"ye?"}], tokenize=False)
'[gMASK]<sop><|user|>\nhi<|assistant|>\nye?'

Should add_bos_token be set to true? What is the reason for it not being set? Sorry if I missed that discussion.

matteoserva · 2025-04-25T05:42:20Z

You can either use add_bos_token=true or have the bos token in the template, not both. Otherwise you get a duplicate bos tokens.
The jinja template already includes the bos token. For compatibility with the jinja one, the legacy template (LLM_CHAT_TEMPLATE_CHATGML_4) also adds the bos token.
This is why add_bos_token is false.
The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

@kallewoof You have to regenerate the gguf with the patched python script and also run it with patched llama-server.

Correct behavior after patch and regenerating the GGUF:

slot update_slots: id  0 | task 0 | prompt token   0: 151331 '[gMASK]'
slot update_slots: id  0 | task 0 | prompt token   1: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   2: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   3:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   4:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   5: 151337 '<|assistant|>'

Before patch without --jinja:

slot update_slots: id  0 | task 0 | prompt token   0: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   1:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   2:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   3: 151337 '<|assistant|>'

Before patch with --jinja:

slot update_slots: id  0 | task 0 | prompt token   0: 151333 '<sop>'
slot update_slots: id  0 | task 0 | prompt token   1: 151336 '<|user|>'
slot update_slots: id  0 | task 0 | prompt token   2:    198 '
'
slot update_slots: id  0 | task 0 | prompt token   3:  14978 'hello'
slot update_slots: id  0 | task 0 | prompt token   4: 151337 '<|assistant|>'

kallewoof · 2025-04-25T05:54:40Z

You can either use add_bos_token=true or have the bos token in the template, not both. Otherwise you get a duplicate bos tokens.

The jinja template already includes the bos token. For compatibility with the jinja one, the legacy template (LLM_CHAT_TEMPLATE_CHATGML_4) also adds the bos token.
This is why add_bos_token is false.

The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

@kallewoof You have to regenerate the gguf with the patched python script and also run it with patched llama-server.

Aha, thanks. I did indeed regenerate the GGUF and did not get the expected results. I am using llama-server though, and connecting to it from a different front end. The prompt being sent is

<sop><system>
sysmessage
...

but from what you are saying it looks like I should not include <sop> either. But I'm not sure how this relates to llama-server and requests via the API. I assumed it would work the same, but I don't think it is. I'll double check though.

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

matteoserva · 2025-04-25T06:07:42Z

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

Yes, the change affects the chat completion api. (open webUI and similar frontends)
In the completion mode you have to begin the prompt with [gMASK]<sop> as specified in the official template.

Mushoz · 2025-04-25T06:46:45Z

slot update_slots: id 0 | task 0 | prompt token 0: 151331 '[gMASK]'
slot update_slots: id 0 | task 0 | prompt token 1: 151333 ''
slot update_slots: id 0 | task 0 | prompt token 2: 151336 '<|user|>'
slot update_slots: id 0 | task 0 | prompt token 3: 198 '
'
slot update_slots: id 0 | task 0 | prompt token 4: 14978 'hello'
slot update_slots: id 0 | task 0 | prompt token 5: 151337 '<|assistant|>'

Sorry for the slight offtopic, but this is really interesting to read! Can you please teach me how you can inspect what the actual template that was used looks like? Is this llama-server's output with some additional switches? There have been many LLM releases with broken / incorrect templates, so it would be a really useful skill to have to be able to inspect what tokens the LLM actually ended up processing as opposed to looking at the static template. Thanks!

Edit: Found it myself! For those interested, add --verbose flag to llama-server

Mushoz · 2025-04-25T06:57:39Z

One more thing, I also found this output while investigating template issue and this proposed fix. I am not sure if this means the tokenizer has a bug that needs fixing, but I just wanted to put this out in case nobody else has noticed it so far:

llama.cpp-1  | init_tokenizer: initializing tokenizer for type 2
llama.cpp-1  | load: control token: 151342 '<|end_of_video|>' is not marked as EOG
llama.cpp-1  | load: control token: 151341 '<|begin_of_video|>' is not marked as EOG
llama.cpp-1  | load: control token: 151338 '<|observation|>' is not marked as EOG
llama.cpp-1  | load: control token: 151333 '<sop>' is not marked as EOG
llama.cpp-1  | load: control token: 151331 '[gMASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151330 '[MASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151337 '<|assistant|>' is not marked as EOG
llama.cpp-1  | load: control token: 151332 '[sMASK]' is not marked as EOG
llama.cpp-1  | load: control token: 151334 '<eop>' is not marked as EOG
llama.cpp-1  | load: control token: 151335 '<|system|>' is not marked as EOG
llama.cpp-1  | load: control token: 151336 '<|user|>' is not marked as EOG
llama.cpp-1  | load: control token: 151340 '<|end_of_image|>' is not marked as EOG
llama.cpp-1  | load: control token: 151339 '<|begin_of_image|>' is not marked as EOG
llama.cpp-1  | load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

kallewoof · 2025-04-25T06:59:24Z

Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send [gMASK]<sop> from the client side, yes?

Yes, the change affects the chat completion api. (open webUI and similar frontends) In the completion mode you have to begin the prompt with [gMASK]<sop> as specified in the official template.

That works. The one thing I would suggest is to enable add_bos_token and to remove the BOS token from the chat template. We now have:

Models that do not take a BOS token at the start.
Models that take a BOS token at the start, which is added by llama.cpp during chat completion and during completion.
Models that take a BOS token at the start, which is added by llama.cpp during chat completion but not during completion.

The alternative would be to never include the BOS token for completion, but there's the backwards compatibility issue there...

Edit: another alternative would be to add a assume_bos_start flag which will put a BOS token at the start if none is found. That would address the compatibility issues.

matteoserva · 2025-04-25T07:25:59Z

That works. The one thing I would suggest is to enable add_bos_token and to remove the BOS token from the chat template.

Yes, this is the ideal outcome. The problem is that we don't decide the chat template, it's provided by the company that releases the model. There are discussions here: #13021 (comment) or here THUDM/GLM-Edge#2

@Mushoz I noticed that as well but ignored it as outside of the scope of this PR, If needed I'll open another PR.

kallewoof · 2025-04-25T07:36:42Z

Yes, this is the ideal outcome. The problem is that we don't decide the chat template, it's provided by the company that releases the model. There are discussions here: #13021 (comment) or here THUDM/GLM-Edge#2

Thanks for the context. Yes, I understand you don't have control over the chat template -- my suggestion is to modify it to handle this case, either for GLM-4 specifically or for any case where the chat template starts with the BOS token explicitly. I don't see a reason why llama.cpp should juggle both cases when a one time snip can be done trivially.

diaopal · 2025-04-25T08:18:04Z

I agree. There shouldn't be a need to change the BOS token just to make the Jinja template (--jinja) work. If anything, the fact that the Jinja template doesn't just work suggests that there's a bug that's been overlooked, patched over with hacky solutions like changing the BOS token.

The bos token cannot be a prefix of the jinja template, that's why I changed it to eos as was done in similar models.

Why can't the BOS token serve as the prefix? At that point, you're not even using the original Jinja template if it's being silently modified on the fly like that. Since add_bos_token isn't enabled for this model, there wouldn't be any issues anyway.

Also, I realized this PR makes it as if you were specifying --chat-template chatglm4 in llama-server.

matteoserva · 2025-04-25T08:33:20Z

I agree. There shouldn't be a need to change the BOS token just to make the Jinja template (--jinja) work. If anything, the fact that the Jinja template doesn't just work suggests that there's a bug that's been overlooked, patched over with hacky solutions like changing the BOS token.

I think it's the opposite?

The previous value of the bos token was a hack introduced in #13021 to fix a crash. It fixed the issue but it introduced new bugs.

I found out that the problems were not caused by the bos token but by the wrong chat template being loaded. So this PR removes the bos token hack and it fixes the chat template loading. This should more closely match the official config: https://huggingface.co/THUDM/GLM-4-32B-0414/blob/main/tokenizer_config.json
which is no bos token and add_bos_token=false
In GLM4 <|endoftext|> is a padding token so it should be appropriate to be put in place of the bos token

moo-v · 2025-04-25T20:26:54Z

Is their template even correct? chat_template.jinja has newlines before the roles, and the chat_template in tokenizer_config.json doesn't.

matteoserva · 2025-04-25T20:37:42Z

From my understanding the chat_template.jinja has extra formatting to make it human readable but it's ignored by the code.

The template in tokenizer_config.json is the correct one. The only problem I see is that it's missing the final newline when add_generation_prompt=true, so it's instead inserted at the beginning of each assistant response, breaking multi turn conversations.

ngxson · 2025-04-27T17:49:13Z

convert_hf_to_gguf.py

@@ -5153,7 +5153,7 @@ def set_vocab(self):
        special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
        special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
        special_vocab._set_special_token("unk", tokenizer.get_added_vocab()["<|endoftext|>"])
-        special_vocab._set_special_token("bos", tokenizer.get_added_vocab()["[gMASK]"])
+        special_vocab._set_special_token("bos", tokenizer.get_added_vocab()["<|endoftext|>"])


Instead of changing BOS, I think what should be done is to set add_add_bos(False) to prevent tokenize function from adding BOS to the sentence

ngxson · 2025-04-27T17:50:14Z

src/llama-chat.cpp

@@ -155,7 +158,7 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) {
    } else if (tmpl_contains("[gMASK]sop")) {
        // chatglm3-6b
        return LLM_CHAT_TEMPLATE_CHATGML_3;
-    } else if (tmpl_contains("[gMASK]<sop>")) {
+    } else if (tmpl_contains("[gMASK]<sop>")) { /* old GLM4 models */


It's better to simply move this code block to above the else if (tmpl_contains("<|assistant|>") && tmpl_contains("<|user|>")), so the condition for GLM will have higher priority

Mushoz · 2025-04-27T18:14:14Z

Why was this closed? Is this no longer needed?

matteoserva · 2025-04-27T18:15:49Z

The PR is still needed. I launched the wrong git command and broke my repo. I reopened the pr here: #13140

matteoserva mentioned this pull request Apr 24, 2025

Append mult-eos,half-rope,bos to GLM4-0414 and Z #13021

Merged

4 tasks

matteoserva marked this pull request as draft April 24, 2025 16:34

matteoserva marked this pull request as ready for review April 24, 2025 17:02

github-actions bot added the python python script changes label Apr 24, 2025

ubergarm mentioned this pull request Apr 24, 2025

Add GLM-4-0414 Model Support ikawrakow/ik_llama.cpp#344

Merged

kallewoof mentioned this pull request Apr 25, 2025

chat preset: GLM-4 SillyTavern/SillyTavern#3906

Merged

1 task

matteoserva mentioned this pull request Apr 25, 2025

GLM-4-0414 uses the wrong template ollama/ollama#10408

Open

eleius mentioned this pull request Apr 27, 2025

Eval bug: GLM-Z1-9B-0414 #12946

Open

ngxson mentioned this pull request Apr 27, 2025

mtmd : fix glm-edge redundant token count #13139

Merged

ngxson reviewed Apr 27, 2025

View reviewed changes

matteoserva closed this Apr 27, 2025

matteoserva force-pushed the master branch from 137fb8a to 77d5e9a Compare April 27, 2025 18:10

matteoserva mentioned this pull request Apr 27, 2025

fix wrong template in GLM4-0414 #13140

Merged

ngxson mentioned this pull request Apr 27, 2025

llama-chat : fix typo GML --> GLM #13143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix wrong template in GLM4-0414 #13099

fix wrong template in GLM4-0414 #13099

matteoserva commented Apr 24, 2025 •

edited

Loading

sultanqasim commented Apr 24, 2025

matteoserva commented Apr 24, 2025

kallewoof commented Apr 25, 2025

diaopal commented Apr 25, 2025

kallewoof commented Apr 25, 2025

matteoserva commented Apr 25, 2025

kallewoof commented Apr 25, 2025 •

edited

Loading

matteoserva commented Apr 25, 2025

Mushoz commented Apr 25, 2025 •

edited

Loading

Mushoz commented Apr 25, 2025

kallewoof commented Apr 25, 2025 •

edited

Loading

matteoserva commented Apr 25, 2025

kallewoof commented Apr 25, 2025 •

edited

Loading

diaopal commented Apr 25, 2025

matteoserva commented Apr 25, 2025

moo-v commented Apr 25, 2025

matteoserva commented Apr 25, 2025

ngxson Apr 27, 2025

ngxson Apr 27, 2025

Mushoz commented Apr 27, 2025

matteoserva commented Apr 27, 2025 •

edited

Loading

fix wrong template in GLM4-0414 #13099

fix wrong template in GLM4-0414 #13099

Conversation

matteoserva commented Apr 24, 2025 • edited Loading

sultanqasim commented Apr 24, 2025

matteoserva commented Apr 24, 2025

kallewoof commented Apr 25, 2025

diaopal commented Apr 25, 2025

kallewoof commented Apr 25, 2025

matteoserva commented Apr 25, 2025

kallewoof commented Apr 25, 2025 • edited Loading

matteoserva commented Apr 25, 2025

Mushoz commented Apr 25, 2025 • edited Loading

Mushoz commented Apr 25, 2025

kallewoof commented Apr 25, 2025 • edited Loading

matteoserva commented Apr 25, 2025

kallewoof commented Apr 25, 2025 • edited Loading

diaopal commented Apr 25, 2025

matteoserva commented Apr 25, 2025

moo-v commented Apr 25, 2025

matteoserva commented Apr 25, 2025

ngxson Apr 27, 2025

Choose a reason for hiding this comment

ngxson Apr 27, 2025

Choose a reason for hiding this comment

Mushoz commented Apr 27, 2025

matteoserva commented Apr 27, 2025 • edited Loading

matteoserva commented Apr 24, 2025 •

edited

Loading

kallewoof commented Apr 25, 2025 •

edited

Loading

Mushoz commented Apr 25, 2025 •

edited

Loading

kallewoof commented Apr 25, 2025 •

edited

Loading

kallewoof commented Apr 25, 2025 •

edited

Loading

matteoserva commented Apr 27, 2025 •

edited

Loading