-
Notifications
You must be signed in to change notification settings - Fork 11.7k
fix wrong template in GLM4-0414 #13099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Perhaps the CHATGML constants should be renamed CHATGLM? |
I don't want to touch code that is outside of this PR |
+1 - see SillyTavern/SillyTavern#3906 |
Why did you change the |
Agree, BOS token should not change. Also I am not seeing the BOS token added when using the model, as you would expect when comparing to e.g. AutoTokenizer:
vs transformers: >>> tokenizer.apply_chat_template([{"role":"user","content":"hi"},{"role":"assistant","content":"ye?"}], tokenize=False)
'[gMASK]<sop><|user|>\nhi<|assistant|>\nye?' Should |
@kallewoof You have to regenerate the gguf with the patched python script and also run it with patched llama-server. Correct behavior after patch and regenerating the GGUF:
Before patch without --jinja:
Before patch with --jinja:
|
Aha, thanks. I did indeed regenerate the GGUF and did not get the expected results. I am using llama-server though, and connecting to it from a different front end. The prompt being sent is
but from what you are saying it looks like I should not include Edit: no, this is chat completion vs completion stuff. For chat completion your way probably works fine. For completion, I need to send |
Yes, the change affects the chat completion api. (open webUI and similar frontends) |
Sorry for the slight offtopic, but this is really interesting to read! Can you please teach me how you can inspect what the actual template that was used looks like? Is this llama-server's output with some additional switches? There have been many LLM releases with broken / incorrect templates, so it would be a really useful skill to have to be able to inspect what tokens the LLM actually ended up processing as opposed to looking at the static template. Thanks! Edit: Found it myself! For those interested, add |
One more thing, I also found this output while investigating template issue and this proposed fix. I am not sure if this means the tokenizer has a bug that needs fixing, but I just wanted to put this out in case nobody else has noticed it so far:
|
That works. The one thing I would suggest is to enable
The alternative would be to never include the BOS token for completion, but there's the backwards compatibility issue there... Edit: another alternative would be to add a |
Yes, this is the ideal outcome. The problem is that we don't decide the chat template, it's provided by the company that releases the model. There are discussions here: #13021 (comment) or here THUDM/GLM-Edge#2 @Mushoz I noticed that as well but ignored it as outside of the scope of this PR, If needed I'll open another PR. |
Thanks for the context. Yes, I understand you don't have control over the chat template -- my suggestion is to modify it to handle this case, either for GLM-4 specifically or for any case where the chat template starts with the BOS token explicitly. I don't see a reason why llama.cpp should juggle both cases when a one time snip can be done trivially. |
I agree. There shouldn't be a need to change the BOS token just to make the Jinja template (
Why can't the BOS token serve as the prefix? At that point, you're not even using the original Jinja template if it's being silently modified on the fly like that. Since Also, I realized this PR makes it as if you were specifying |
I think it's the opposite? The previous value of the bos token was a hack introduced in #13021 to fix a crash. It fixed the issue but it introduced new bugs. I found out that the problems were not caused by the bos token but by the wrong chat template being loaded. So this PR removes the bos token hack and it fixes the chat template loading. This should more closely match the official config: https://huggingface.co/THUDM/GLM-4-32B-0414/blob/main/tokenizer_config.json |
Is their template even correct? |
From my understanding the chat_template.jinja has extra formatting to make it human readable but it's ignored by the code. The template in tokenizer_config.json is the correct one. The only problem I see is that it's missing the final newline when add_generation_prompt=true, so it's instead inserted at the beginning of each assistant response, breaking multi turn conversations. |
convert_hf_to_gguf.py
Outdated
@@ -5153,7 +5153,7 @@ def set_vocab(self): | |||
special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"]) | |||
special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"]) | |||
special_vocab._set_special_token("unk", tokenizer.get_added_vocab()["<|endoftext|>"]) | |||
special_vocab._set_special_token("bos", tokenizer.get_added_vocab()["[gMASK]"]) | |||
special_vocab._set_special_token("bos", tokenizer.get_added_vocab()["<|endoftext|>"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing BOS, I think what should be done is to set add_add_bos(False)
to prevent tokenize function from adding BOS to the sentence
src/llama-chat.cpp
Outdated
@@ -155,7 +158,7 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) { | |||
} else if (tmpl_contains("[gMASK]sop")) { | |||
// chatglm3-6b | |||
return LLM_CHAT_TEMPLATE_CHATGML_3; | |||
} else if (tmpl_contains("[gMASK]<sop>")) { | |||
} else if (tmpl_contains("[gMASK]<sop>")) { /* old GLM4 models */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to simply move this code block to above the else if (tmpl_contains("<|assistant|>") && tmpl_contains("<|user|>"))
, so the condition for GLM will have higher priority
Why was this closed? Is this no longer needed? |
The PR is still needed. I launched the wrong git command and broke my repo. I reopened the pr here: #13140 |
GLM4-0414 models were using the wrong legacy template, leading to a missing
[gMASK]<sop>
preamble.The old code was returning
LLM_CHAT_TEMPLATE_GLMEDGE
As a workaround you needed to launch
llama-server
with--chat-template chatglm4
After the patch the gguf should be regenerated or edited manually.
EDIT:
Added workaround explanation