ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.014 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 55662.79 MB
I build_info: b8783-e21cdc11a
I system_info: n_threads = 8 (n_threads_batch = 8) / 14 | MTL : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | REPACK = 1 |
I Running without SSL
I init: using 13 threads for HTTP server
I Web UI is disabled
I start: binding port with default address family
I main: loading model
I srv load_model: loading model '/Users/chillum/Library/Caches/llama.cpp/medgemma-27b-it-UD-Q8_K_XL.gguf'
I common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
I llama_params_fit_impl: projected to use 41716 MiB of device memory vs. 53083 MiB of free device memory
I llama_params_fit_impl: will leave 11366 >= 1024 MiB of free device memory, no changes needed
I llama_params_fit: successfully fit params to free device memory
I llama_params_fit: fitting params to free memory took 0.15 seconds
I llama_model_load_from_file_impl: using device MTL0 (Apple M4 Pro) (unknown id) - 53083 MiB free
I llama_model_loader: loaded meta data with 50 key-value pairs and 808 tensors from /Users/chillum/Library/Caches/llama.cpp/medgemma-27b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
I llama_model_loader: - kv 0: general.architecture str = gemma3
I llama_model_loader: - kv 1: general.type str = model
I llama_model_loader: - kv 2: general.name str = Medgemma-27B-It
I llama_model_loader: - kv 3: general.finetune str = it
I llama_model_loader: - kv 4: general.basename str = Medgemma-27B-It
I llama_model_loader: - kv 5: general.quantized_by str = Unsloth
I llama_model_loader: - kv 6: general.size_label str = 27B
I llama_model_loader: - kv 7: general.license str = other
I llama_model_loader: - kv 8: general.license.name str = health-ai-developer-foundations
I llama_model_loader: - kv 9: general.license.link str = https://developers.google.com/health-...
I llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
I llama_model_loader: - kv 11: general.base_model.count u32 = 1
I llama_model_loader: - kv 12: general.base_model.0.name str = Medgemma 27b It
I llama_model_loader: - kv 13: general.base_model.0.organization str = Google
I llama_model_loader: - kv 14: general.base_model.0.repo_url str = https://huggingface.co/google/medgemm...
I llama_model_loader: - kv 15: general.tags arr[str,3] = ["medical", "unsloth - x-ray - pathol...
I llama_model_loader: - kv 16: general.languages arr[str,1] = ["en"]
I llama_model_loader: - kv 17: gemma3.context_length u32 = 131072
I llama_model_loader: - kv 18: gemma3.embedding_length u32 = 5376
I llama_model_loader: - kv 19: gemma3.block_count u32 = 62
I llama_model_loader: - kv 20: gemma3.feed_forward_length u32 = 21504
I llama_model_loader: - kv 21: gemma3.attention.head_count u32 = 32
I llama_model_loader: - kv 22: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
I llama_model_loader: - kv 23: gemma3.attention.key_length u32 = 128
I llama_model_loader: - kv 24: gemma3.attention.value_length u32 = 128
I llama_model_loader: - kv 25: gemma3.rope.freq_base f32 = 1000000.000000
I llama_model_loader: - kv 26: gemma3.attention.sliding_window u32 = 1024
I llama_model_loader: - kv 27: gemma3.attention.head_count_kv u32 = 16
I llama_model_loader: - kv 28: gemma3.rope.scaling.type str = linear
I llama_model_loader: - kv 29: gemma3.rope.scaling.factor f32 = 8.000000
I llama_model_loader: - kv 30: tokenizer.ggml.model str = llama
I llama_model_loader: - kv 31: tokenizer.ggml.pre str = default
I llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
I llama_model_loader: - kv 33: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
I llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
I llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 2
I llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 106
I llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 3
I llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 0
I llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = true
I llama_model_loader: - kv 40: tokenizer.ggml.add_sep_token bool = false
I llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false
I llama_model_loader: - kv 42: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
I llama_model_loader: - kv 43: tokenizer.ggml.add_space_prefix bool = false
I llama_model_loader: - kv 44: general.quantization_version u32 = 2
I llama_model_loader: - kv 45: general.file_type u32 = 7
I llama_model_loader: - kv 46: quantize.imatrix.file str = medgemma-27b-it-GGUF/imatrix_unsloth.dat
I llama_model_loader: - kv 47: quantize.imatrix.dataset str = unsloth_calibration_medgemma-27b-it.txt
I llama_model_loader: - kv 48: quantize.imatrix.entries_count u32 = 434
I llama_model_loader: - kv 49: quantize.imatrix.chunks_count u32 = 663
I llama_model_loader: - type f32: 373 tensors
I llama_model_loader: - type f16: 26 tensors
I llama_model_loader: - type q8_0: 409 tensors
I print_info: file format = GGUF V3 (latest)
I print_info: file type = Q8_0
I print_info: file size = 29.62 GiB (9.42 BPW)
I load: 6242 unused tokens
W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
I load: printing all EOG tokens:
I load: - 1 ('<eos>')
I load: - 106 ('<end_of_turn>')
I load: - 212 ('</s>')
I load: special tokens cache size = 6415
I load: token to piece cache size = 1.9446 MB
I print_info: arch = gemma3
I print_info: vocab_only = 0
I print_info: no_alloc = 0
I print_info: n_ctx_train = 131072
I print_info: n_embd = 5376
I print_info: n_embd_inp = 5376
I print_info: n_layer = 62
I print_info: n_head = 32
I print_info: n_head_kv = 16
I print_info: n_rot = 128
I print_info: n_swa = 1024
I print_info: is_swa_any = 1
I print_info: n_embd_head_k = 128
I print_info: n_embd_head_v = 128
I print_info: n_gqa = 2
I print_info: n_embd_k_gqa = 2048
I print_info: n_embd_v_gqa = 2048
I print_info: f_norm_eps = 0.0e+00
I print_info: f_norm_rms_eps = 1.0e-06
I print_info: f_clamp_kqv = 0.0e+00
I print_info: f_max_alibi_bias = 0.0e+00
I print_info: f_logit_scale = 0.0e+00
I print_info: f_attn_scale = 7.7e-02
I print_info: n_ff = 21504
I print_info: n_expert = 0
I print_info: n_expert_used = 0
I print_info: n_expert_groups = 0
I print_info: n_group_used = 0
I print_info: causal attn = 1
I print_info: pooling type = -1
I print_info: rope type = 2
I print_info: rope scaling = linear
I print_info: freq_base_train = 1000000.0
I print_info: freq_scale_train = 0.125
I print_info: freq_base_swa = 10000.0
I print_info: freq_scale_swa = 1
I print_info: n_embd_head_k_swa = 128
I print_info: n_embd_head_v_swa = 128
I print_info: n_rot_swa = 128
I print_info: n_ctx_orig_yarn = 131072
I print_info: rope_yarn_log_mul = 0.0000
I print_info: rope_finetuned = unknown
I print_info: model type = 27B
I print_info: model params = 27.01 B
I print_info: general.name = Medgemma-27B-It
I print_info: vocab type = SPM
I print_info: n_vocab = 262208
I print_info: n_merges = 0
I print_info: BOS token = 2 '<bos>'
I print_info: EOS token = 106 '<end_of_turn>'
I print_info: EOT token = 106 '<end_of_turn>'
I print_info: UNK token = 3 '<unk>'
I print_info: PAD token = 0 '<pad>'
I print_info: LF token = 248 '<0x0A>'
I print_info: EOG token = 1 '<eos>'
I print_info: EOG token = 106 '<end_of_turn>'
I print_info: EOG token = 212 '</s>'
I print_info: max token length = 48
I load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
I load_tensors: offloading output layer to GPU
I load_tensors: offloading 61 repeating layers to GPU
I load_tensors: offloaded 63/63 layers to GPU
I load_tensors: CPU_Mapped model buffer size = 2688.66 MiB
I load_tensors: MTL0_Mapped model buffer size = 30330.15 MiB
......................................................................................
I common_init_result: added <eos> logit bias = -inf
I common_init_result: added <end_of_turn> logit bias = -inf
I common_init_result: added </s> logit bias = -inf
I llama_context: constructing llama_context
I llama_context: n_seq_max = 1
I llama_context: n_ctx = 131072
I llama_context: n_ctx_seq = 131072
I llama_context: n_batch = 4096
I llama_context: n_ubatch = 512
I llama_context: causal_attn = 1
I llama_context: flash_attn = enabled
I llama_context: kv_unified = false
I llama_context: freq_base = 1000000.0
I llama_context: freq_scale = 0.125
I ggml_metal_init: allocating
I ggml_metal_init: found device: Apple M4 Pro
I ggml_metal_init: picking default device: Apple M4 Pro
I ggml_metal_init: use fusion = true
I ggml_metal_init: use concurrency = true
I ggml_metal_init: use graph optimize = true
I llama_context: CPU output buffer size = 1.00 MiB
I llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
I llama_kv_cache: MTL0 KV buffer size = 10240.00 MiB
I llama_kv_cache: size = 10240.00 MiB (131072 cells, 10 layers, 1/1 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
I llama_kv_cache_iswa: creating SWA KV cache, size = 1536 cells
I llama_kv_cache: MTL0 KV buffer size = 624.00 MiB
I llama_kv_cache: size = 624.00 MiB ( 1536 cells, 52 layers, 1/1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB
I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
I sched_reserve: reserving ...
I sched_reserve: resolving fused Gated Delta Net support:
I sched_reserve: fused Gated Delta Net (autoregressive) enabled
I sched_reserve: fused Gated Delta Net (chunked) enabled
I sched_reserve: MTL0 compute buffer size = 522.62 MiB
I sched_reserve: CPU compute buffer size = 280.02 MiB
I sched_reserve: graph nodes = 2489
I sched_reserve: graph splits = 2
I sched_reserve: reserve took 5.82 ms, sched copies = 1
W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
I srv load_model: initializing slots, n_slots = 1
W no implementations specified for speculative decoding
I slot load_model: id 0 | task -1 | speculative decoding context not initialized
I slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
W srv load_model: prompt cache is enabled, size limit: 8192 MiB
W srv load_model: use `--cache-ram 0` to disable the prompt cache
W srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
W srv init: init: --clear-idle requires --kv-unified, disabling
I init: chat template, example_format: '<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
I srv init: init: chat template, thinking = 0
I main: model loaded
I main: server is listening on http://0.0.0.0:5804
I main: starting the main loop...
I srv update_slots: all slots are idle
W srv operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵ {{ raise_exception(\"Conversation roles must alternate user...\n ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
W srv operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵ {{ raise_exception(\"Conversation roles must alternate user...\n ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
W srv operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵ {{ raise_exception(\"Conversation roles must alternate user...\n ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
W srv operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 19, column 27 in source:\n...% 2 == 0) -%}↵ {{ raise_exception(\"Conversation roles must alternate user...\n ^\nError: Jinja Exception: Conversation roles must alternate user/assistant/user/assistant/...","type":"server_error"}}
I srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Name and Version
version: 8783 (e21cdc1)
built with AppleClang 15.0.0.15000309 for Darwin arm64
Operating systems
Mac
GGML backends
Metal
Hardware
m4 Mac pro
Models
medgemma-27b-it-GGUF
Problem description & steps to reproduce
Setup
opencode-->llama-swap-->llama.cpp
Starting it up with a hello
logs
This one I believe had some recent handling of
</s>First Bad Commit
No response
Relevant log output
Logs