Description
Command-R support was recently merged here: #6033
This issue is also discussed here where I initially thought it might be a bug on HF implementation side: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27
The model uses BPE; however something with tokenization is not exactly the same. I don't think it has any major impact on output quality but it does lead to the implementations disagreeing with top logits a little bit in some of my tests.
To test Command-R tokens we can use this with the HF model:
#!/usr/bin/env python3
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
test_string = """
This is a sentence.
### Sentence
"""
print(tokenizer.encode(test_string))
# -> [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]
Llama.cpp comparison (I hacked tokenize
to read the string from a file given by filepath to argv[2] instead of tokenizing argv[2]...do any of the cli tools print the tokens without having to do that?)
$ tokenize ~/text-generation-webui/models/commandr_dev_f16.gguf tokens_test2
<omitted output until token list>
5 -> ''
206 -> '
'
4184 -> 'This'
1801 -> ' is'
1671 -> ' a'
27281 -> ' sentence'
21 -> '.'
2126 -> '
'
2680 -> '###'
195143 -> ' Sentence'
206 -> '
'
To put the token lists side by side for readability:
HF:
# [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]
llama.cpp
# [5, 206, 4184, 1801, 1671, 27281, 21, 2126, 2680, 195143, 206]
The part that's different is two 206s vs one 2126. (206 = '\n', 2126 = '\n\n').
As far as I can tell, both implementations will decode back to the original strings exactly always.
The tokenizers don't seem exactly the same. It seems that llama.cpp
is more eager to give 2126 for \n\n
than HF version.
I verified with Cohere that their implementation is correct (https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27) I initially thought llama.cpp
was correct and theirs was buggy.
The model might be slightly smarter if we match the tokenization, as then it would match how the model was trained. Empirically testing around I really don't think this impacts output quality in any material way, but it can influence ordering of top tokens a bit that can be noticable. I had the logits reorder themselves in a llama.cpp vs HF test prompt of about 2200 tokens where 7 tokens diverged (all of them two 206s vs one 2126). Maybe with particular kinds of prompts the divergence in tokenization would be much greater and output much different.
I'll offer to investigate and do a PR with an ETA some time next week when I can invest more time. Haven't read the tokenization code on either HF or llama.cpp yet as of opening this issue.