Skip to content

Eval bug: medgemma causes a "Jinja Exception" #21879

@chhil

Description

@chhil

Name and Version

version: 8783 (e21cdc1)
built with AppleClang 15.0.0.15000309 for Darwin arm64

Operating systems

Mac

GGML backends

Metal

Hardware

m4 Mac pro

Models

medgemma-27b-it-GGUF

Problem description & steps to reproduce

Setup
opencode-->llama-swap-->llama.cpp

Starting it up with a hello

logs

This one I believe had some recent handling of </s>

W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
W srv    operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵        {{ raise_exception(\"Conversation roles must alternate user...\n                                           ^\nError: Jinja Exception: Conversation 

First Bad Commit

No response

Relevant log output

Logs
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.014 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 55662.79 MB
I build_info: b8783-e21cdc11a
I system_info: n_threads = 8 (n_threads_batch = 8) / 14 | MTL : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | REPACK = 1 | 
I Running without SSL
I init: using 13 threads for HTTP server
I Web UI is disabled
I start: binding port with default address family
I main: loading model
I srv    load_model: loading model '/Users/chillum/Library/Caches/llama.cpp/medgemma-27b-it-UD-Q8_K_XL.gguf'
I common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
I llama_params_fit_impl: projected to use 41716 MiB of device memory vs. 53083 MiB of free device memory
I llama_params_fit_impl: will leave 11366 >= 1024 MiB of free device memory, no changes needed
I llama_params_fit: successfully fit params to free device memory
I llama_params_fit: fitting params to free memory took 0.15 seconds
I llama_model_load_from_file_impl: using device MTL0 (Apple M4 Pro) (unknown id) - 53083 MiB free
I llama_model_loader: loaded meta data with 50 key-value pairs and 808 tensors from /Users/chillum/Library/Caches/llama.cpp/medgemma-27b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
I llama_model_loader: - kv   0:                       general.architecture str              = gemma3
I llama_model_loader: - kv   1:                               general.type str              = model
I llama_model_loader: - kv   2:                               general.name str              = Medgemma-27B-It
I llama_model_loader: - kv   3:                           general.finetune str              = it
I llama_model_loader: - kv   4:                           general.basename str              = Medgemma-27B-It
I llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
I llama_model_loader: - kv   6:                         general.size_label str              = 27B
I llama_model_loader: - kv   7:                            general.license str              = other
I llama_model_loader: - kv   8:                       general.license.name str              = health-ai-developer-foundations
I llama_model_loader: - kv   9:                       general.license.link str              = https://developers.google.com/health-...
I llama_model_loader: - kv  10:                           general.repo_url str              = https://huggingface.co/unsloth
I llama_model_loader: - kv  11:                   general.base_model.count u32              = 1
I llama_model_loader: - kv  12:                  general.base_model.0.name str              = Medgemma 27b It
I llama_model_loader: - kv  13:          general.base_model.0.organization str              = Google
I llama_model_loader: - kv  14:              general.base_model.0.repo_url str              = https://huggingface.co/google/medgemm...
I llama_model_loader: - kv  15:                               general.tags arr[str,3]       = ["medical", "unsloth - x-ray - pathol...
I llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
I llama_model_loader: - kv  17:                      gemma3.context_length u32              = 131072
I llama_model_loader: - kv  18:                    gemma3.embedding_length u32              = 5376
I llama_model_loader: - kv  19:                         gemma3.block_count u32              = 62
I llama_model_loader: - kv  20:                 gemma3.feed_forward_length u32              = 21504
I llama_model_loader: - kv  21:                gemma3.attention.head_count u32              = 32
I llama_model_loader: - kv  22:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
I llama_model_loader: - kv  23:                gemma3.attention.key_length u32              = 128
I llama_model_loader: - kv  24:              gemma3.attention.value_length u32              = 128
I llama_model_loader: - kv  25:                      gemma3.rope.freq_base f32              = 1000000.000000
I llama_model_loader: - kv  26:            gemma3.attention.sliding_window u32              = 1024
I llama_model_loader: - kv  27:             gemma3.attention.head_count_kv u32              = 16
I llama_model_loader: - kv  28:                   gemma3.rope.scaling.type str              = linear
I llama_model_loader: - kv  29:                 gemma3.rope.scaling.factor f32              = 8.000000
I llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = llama
I llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = default
I llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
I llama_model_loader: - kv  33:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
I llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
I llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 2
I llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 106
I llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 3
I llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 0
I llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
I llama_model_loader: - kv  40:               tokenizer.ggml.add_sep_token bool             = false
I llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
I llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
I llama_model_loader: - kv  43:            tokenizer.ggml.add_space_prefix bool             = false
I llama_model_loader: - kv  44:               general.quantization_version u32              = 2
I llama_model_loader: - kv  45:                          general.file_type u32              = 7
I llama_model_loader: - kv  46:                      quantize.imatrix.file str              = medgemma-27b-it-GGUF/imatrix_unsloth.dat
I llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_medgemma-27b-it.txt
I llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 434
I llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 663
I llama_model_loader: - type  f32:  373 tensors
I llama_model_loader: - type  f16:   26 tensors
I llama_model_loader: - type q8_0:  409 tensors
I print_info: file format = GGUF V3 (latest)
I print_info: file type   = Q8_0
I print_info: file size   = 29.62 GiB (9.42 BPW) 
I load: 6242 unused tokens
W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
I load: printing all EOG tokens:
I load:   - 1 ('<eos>')
I load:   - 106 ('<end_of_turn>')
I load:   - 212 ('</s>')
I load: special tokens cache size = 6415
I load: token to piece cache size = 1.9446 MB
I print_info: arch                  = gemma3
I print_info: vocab_only            = 0
I print_info: no_alloc              = 0
I print_info: n_ctx_train           = 131072
I print_info: n_embd                = 5376
I print_info: n_embd_inp            = 5376
I print_info: n_layer               = 62
I print_info: n_head                = 32
I print_info: n_head_kv             = 16
I print_info: n_rot                 = 128
I print_info: n_swa                 = 1024
I print_info: is_swa_any            = 1
I print_info: n_embd_head_k         = 128
I print_info: n_embd_head_v         = 128
I print_info: n_gqa                 = 2
I print_info: n_embd_k_gqa          = 2048
I print_info: n_embd_v_gqa          = 2048
I print_info: f_norm_eps            = 0.0e+00
I print_info: f_norm_rms_eps        = 1.0e-06
I print_info: f_clamp_kqv           = 0.0e+00
I print_info: f_max_alibi_bias      = 0.0e+00
I print_info: f_logit_scale         = 0.0e+00
I print_info: f_attn_scale          = 7.7e-02
I print_info: n_ff                  = 21504
I print_info: n_expert              = 0
I print_info: n_expert_used         = 0
I print_info: n_expert_groups       = 0
I print_info: n_group_used          = 0
I print_info: causal attn           = 1
I print_info: pooling type          = -1
I print_info: rope type             = 2
I print_info: rope scaling          = linear
I print_info: freq_base_train       = 1000000.0
I print_info: freq_scale_train      = 0.125
I print_info: freq_base_swa         = 10000.0
I print_info: freq_scale_swa        = 1
I print_info: n_embd_head_k_swa     = 128
I print_info: n_embd_head_v_swa     = 128
I print_info: n_rot_swa             = 128
I print_info: n_ctx_orig_yarn       = 131072
I print_info: rope_yarn_log_mul     = 0.0000
I print_info: rope_finetuned        = unknown
I print_info: model type            = 27B
I print_info: model params          = 27.01 B
I print_info: general.name          = Medgemma-27B-It
I print_info: vocab type            = SPM
I print_info: n_vocab               = 262208
I print_info: n_merges              = 0
I print_info: BOS token             = 2 '<bos>'
I print_info: EOS token             = 106 '<end_of_turn>'
I print_info: EOT token             = 106 '<end_of_turn>'
I print_info: UNK token             = 3 '<unk>'
I print_info: PAD token             = 0 '<pad>'
I print_info: LF token              = 248 '<0x0A>'
I print_info: EOG token             = 1 '<eos>'
I print_info: EOG token             = 106 '<end_of_turn>'
I print_info: EOG token             = 212 '</s>'
I print_info: max token length      = 48
I load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
I load_tensors: offloading output layer to GPU
I load_tensors: offloading 61 repeating layers to GPU
I load_tensors: offloaded 63/63 layers to GPU
I load_tensors:   CPU_Mapped model buffer size =  2688.66 MiB
I load_tensors:  MTL0_Mapped model buffer size = 30330.15 MiB
......................................................................................
I common_init_result: added <eos> logit bias = -inf
I common_init_result: added <end_of_turn> logit bias = -inf
I common_init_result: added </s> logit bias = -inf
I llama_context: constructing llama_context
I llama_context: n_seq_max     = 1
I llama_context: n_ctx         = 131072
I llama_context: n_ctx_seq     = 131072
I llama_context: n_batch       = 4096
I llama_context: n_ubatch      = 512
I llama_context: causal_attn   = 1
I llama_context: flash_attn    = enabled
I llama_context: kv_unified    = false
I llama_context: freq_base     = 1000000.0
I llama_context: freq_scale    = 0.125
I ggml_metal_init: allocating
I ggml_metal_init: found device: Apple M4 Pro
I ggml_metal_init: picking default device: Apple M4 Pro
I ggml_metal_init: use fusion         = true
I ggml_metal_init: use concurrency    = true
I ggml_metal_init: use graph optimize = true
I llama_context:        CPU  output buffer size =     1.00 MiB
I llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
I llama_kv_cache:       MTL0 KV buffer size = 10240.00 MiB
I llama_kv_cache: size = 10240.00 MiB (131072 cells,  10 layers,  1/1 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
I llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
I llama_kv_cache:       MTL0 KV buffer size =   624.00 MiB
I llama_kv_cache: size =  624.00 MiB (  1536 cells,  52 layers,  1/1 seqs), K (f16):  312.00 MiB, V (f16):  312.00 MiB
I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
I sched_reserve: reserving ...
I sched_reserve: resolving fused Gated Delta Net support:
I sched_reserve: fused Gated Delta Net (autoregressive) enabled
I sched_reserve: fused Gated Delta Net (chunked) enabled
I sched_reserve:       MTL0 compute buffer size =   522.62 MiB
I sched_reserve:        CPU compute buffer size =   280.02 MiB
I sched_reserve: graph nodes  = 2489
I sched_reserve: graph splits = 2
I sched_reserve: reserve took 5.82 ms, sched copies = 1
W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
I srv    load_model: initializing slots, n_slots = 1
W no implementations specified for speculative decoding
I slot   load_model: id  0 | task -1 | speculative decoding context not initialized
I slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
W srv    load_model: prompt cache is enabled, size limit: 8192 MiB
W srv    load_model: use `--cache-ram 0` to disable the prompt cache
W srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
W srv          init: init: --clear-idle requires --kv-unified, disabling
I init: chat template, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
I srv          init: init: chat template, thinking = 0
I main: model loaded
I main: server is listening on http://0.0.0.0:5804
I main: starting the main loop...
I srv  update_slots: all slots are idle
W srv    operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵        {{ raise_exception(\"Conversation roles must alternate user...\n                                           ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
W srv    operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵        {{ raise_exception(\"Conversation roles must alternate user...\n                                           ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
W srv    operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵        {{ raise_exception(\"Conversation roles must alternate user...\n                                           ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
W srv    operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵        {{ raise_exception(\"Conversation roles must alternate user...\n                                           ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions