Command-R (CohereForAI model) tokenization disagrees with HF implementation

Command-R support was recently merged here: https://github.com/ggerganov/llama.cpp/pull/6033

This issue is also discussed here where I initially thought it might be a bug on HF implementation side: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27

The model uses BPE; however something with tokenization is not exactly the same. I don't think it has any major impact on output quality but it does lead to the implementations disagreeing with top logits a little bit in some of my tests.

To test Command-R tokens we can use this with the HF model:

```python
#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

test_string = """
This is a sentence.

### Sentence
"""

print(tokenizer.encode(test_string))

# -> [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]
```

Llama.cpp comparison (I hacked `tokenize` to read the string from a file given by filepath to argv[2] instead of tokenizing argv[2]...do any of the cli tools print the tokens without having to do that?)

```
$ tokenize ~/text-generation-webui/models/commandr_dev_f16.gguf tokens_test2
<omitted output until token list>

     5 -> ''
   206 -> '
'
  4184 -> 'This'
  1801 -> ' is'
  1671 -> ' a'
 27281 -> ' sentence'
    21 -> '.'
  2126 -> '

'
  2680 -> '###'
195143 -> ' Sentence'
   206 -> '
'
```

To put the token lists side by side for readability:

```
HF:
# [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]

llama.cpp
# [5, 206, 4184, 1801, 1671, 27281, 21, 2126,     2680, 195143, 206]

The part that's different is two 206s vs one 2126. (206 = '\n', 2126 = '\n\n').
As far as I can tell, both implementations will decode back to the original strings exactly always.
```

The tokenizers don't seem exactly the same. It seems that `llama.cpp` is more eager to give 2126 for `\n\n` than HF version.

I verified with Cohere that their implementation is correct (https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27) I initially thought `llama.cpp` was correct and theirs was buggy.

The model might be slightly smarter if we match the tokenization, as then it would match how the model was trained. Empirically testing around I really don't think this impacts output quality in any material way, but it can influence ordering of top tokens a bit that can be noticable. I had the logits reorder themselves in a llama.cpp vs HF test prompt of about 2200 tokens where 7 tokens diverged (all of them two 206s vs one 2126). Maybe with particular kinds of prompts the divergence in tokenization would be much greater and output much different.

I'll offer to investigate and do a PR with an ETA some time next week when I can invest more time. Haven't read the tokenization code on either HF or llama.cpp yet as of opening this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions