Skip to content

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

Closed
@woodx9

Description

@woodx9

I am trying to convert deepseek-ai/deepseek-coder-1.3b-base using llama.cpp/convert.py
with

Command

python llama.cpp/convert.py codes-hf
--outfile codes-1b.gguf
--outtype q8_0

Output:

Loading model file codes-hf/pytorch_model.bin
params = Params(n_vocab=32256, n_embd=2048, n_layer=24, n_ctx=16384, n_ff=5504, n_head=16, n_head_kv=16, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=<RopeScalingType.LINEAR: 'linear'>, f_rope_freq_base=100000, f_rope_scale=4.0, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyQ8_0: 7>, path_model=PosixPath('codes-hf'))
Traceback (most recent call last):
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1548, in
main()
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1407, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

the "tokenizer_class": "LlamaTokenizerFast", is there a way to support it?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions