Name and Version
llama-server
version: 8792 (f4b5bf2)
built with GNU 14.2.0 for Linux x86_64
NVIDIA-SMI version: 595.58.03
NVML version: 595.58
DRIVER version: 595.58.03
CUDA Version: 13.2
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA RTX PRO 6000 Blackwell Workstation Edition
AMD EPYC 7543P 32-Core Processor
Models
Qwen3.5-397B-A17B-UD-Q5_K_XL (https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF)
Gemma-4-31B-it-BF16 (https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF)
MiniMax-M2.7-Q8_0 (https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF)
Problem description & steps to reproduce
llama-server crashes or hangs on some new models.
Linux kernel logs AMD IOMMU I/O page‑faults and then a GPU MMU fault (Xid 31)
Git bisect
Between commit 85d482e (good) and
commit f4b5bf2 (bad)
d6f3030 is the first bad commit
At each git bisect step llama-cpp was re-built with:
git clean -xdfq
cmake -B build -DGGML_ZENDNN=ON -DGGML_CUDA=ON
cmake --build build --config Release -j 32
Linux kernel (dmesg) logs are the same (see below) for all these models, while llama-server behaves different depending on the model:
Qwen3.5-397B
Crashes on first request.
Gemma-4-31B
Crashes on server start before first request.
MiniMax-M2.7
llama-server to hang on the first request, without any logged errors. dmesg shows the same IO_PAGE_FAULT however.
First Bad Commit
d6f3030 is the first bad commit
Relevant log output
Qwen3.5-397B
llama-server --host 192.168.88.249 --port 14128 --threads 8 --threads-http 2 --ctx-size 131072 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.00 --batch-size 2048 --ubatch-size 2048 -fa off --no-mmap --model /srv/d/.cache/hf/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/da33c16fa4440f831149fcf53b98a22bc07785e5/UD-Q5_K_XL/Qwen3.5-397B-A17B-UD-Q5_K_XL-00001-of-00008.gguf
(...)
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
srv get_availabl: updating prompt cache
srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 131072 tokens, 8589934592 est)
srv get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1524
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 1520, batch.n_tokens = 1520, progress = 0.997375
/srv/d/p/llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at /srv/d/p/llm/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:3005
cudaStreamSynchronize(cuda_ctx->stream())
/srv/d/p/llm/llama.cpp/build/bin/libggml-base.so.0(+0x172a5) [0x7fe132c8b2a5]
/srv/d/p/llm/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1df) [0x7fe132c8b66f]
/srv/d/p/llm/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11e) [0x7fe132c8b7fe]
/srv/d/p/llm/llama.cpp/build/bin/libggml-cuda.so.0(+0x1b34f3) [0x7fe12f5b34f3]
/srv/d/p/llm/llama.cpp/build/bin/libggml-cuda.so.0(+0x1b46d0) [0x7fe12f5b46d0]
/srv/d/p/llm/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x815) [0x7fe132ca7bb5]
/srv/d/p/llm/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1) [0x7fe1322afad1]
/srv/d/p/llm/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe4) [0x7fe1322b2084]
/srv/d/p/llm/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x34f) [0x7fe1322b7c8f]
/srv/d/p/llm/llama.cpp/build/bin/libllama.so.0(llama_decode+0xb) [0x7fe1322b96bb]
llama-server(+0x160af3) [0x558c4c542af3]
llama-server(+0x1e5a2f) [0x558c4c5c7a2f]
llama-server(+0xba4aa) [0x558c4c49c4aa]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fe131c35ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fe131c35d65]
llama-server(+0xc14f1) [0x558c4c4a34f1]
LIBXSMM_VERSION: feature_print_bw-1.17-3780 (25693892)
LIBXSMM_TARGET: hsw [AMD EPYC 7543P 32-Core Processor]
Registry and code: 13 MB
Command: llama-server --host 192.168.88.249 --port 14128 --threads 8 --threads-http 2 --ctx-size 131072 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.00 --batch-size 2048 --ubatch-size 2048 -fa off --no-mmap --model /srv/d/.cache/hf/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/da33c16fa4440f831149fcf53b98a22bc07785e5/UD-Q5_K_XL/Qwen3.5-397B-A17B-UD-Q5_K_XL-00001-of-00008.gguf
Uptime: 288.167928 s
Aborted
Gemma-4-31B
llama-server -v --device CUDA0 --host 192.168.88.249 --port 14128 --threads 8 --threads-http 2 --ctx-size 131072 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.00 --batch-size 2048 --ubatch-size 2048 -fa off --no-mmap --model /srv/d/.cache/llama.cpp/models--bartowski--google_gemma-4-31B-it-GGUF/snapshots/a83d9d3efca681556dbd277fc50fde765041a3b0/google_gemma-4-31B-it-bf16/google_gemma-4-31B-it-bf16-00001-of-00002.gguf
double free or corruption (!prev)
double free or corruption (!prev)
LIBXSMM_VERSION: feature_print_bw-1.17-3780 (25693892)double free or corruption (!prev)
Aborted
dmesg
[ 7460.483909] amd_iommu_report_page_fault: 6 callbacks suppressed
[ 7460.483919] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xd0826240 flags=0x0020]
[ 7460.485503] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0314000 flags=0x0020]
[ 7460.486330] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0310000 flags=0x0020]
[ 7460.487121] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0314c00 flags=0x0020]
[ 7460.487883] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0311000 flags=0x0020]
[ 7460.488624] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0315000 flags=0x0020]
[ 7460.489330] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0316000 flags=0x0020]
[ 7460.490004] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0312000 flags=0x0020]
[ 7460.490676] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0315e00 flags=0x0020]
[ 7460.491352] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0312100 flags=0x0020]
[ 7474.898124] amd_iommu_report_page_fault: 77 callbacks suppressed
[ 7474.898133] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0314000 flags=0x0020]
[ 7474.899573] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xd0826240 flags=0x0020]
[ 7474.900302] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0310000 flags=0x0020]
[ 7474.901005] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0311100 flags=0x0020]
[ 7474.901705] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0315100 flags=0x0020]
[ 7474.902399] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0314400 flags=0x0020]
[ 7474.903082] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0310200 flags=0x0020]
[ 7474.903873] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0311c00 flags=0x0020]
[ 7474.904546] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0315000 flags=0x0020]
[ 7474.905221] nvidia 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0x2bee0316100 flags=0x0020]
[ 7474.908632] AMD-Vi: IOMMU Event log restarting
[ 7475.278724] amd_iommu_report_page_fault: 2564 callbacks suppressed
[ 7475.278730] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xf4826240 flags=0x0020]
[ 7475.279811] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0014000 flags=0x0020]
[ 7475.280652] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0010000 flags=0x0020]
[ 7475.281418] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0015000 flags=0x0020]
[ 7475.282116] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0011100 flags=0x0020]
[ 7475.282814] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0010d00 flags=0x0020]
[ 7475.283502] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0016100 flags=0x0020]
[ 7475.284175] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0015600 flags=0x0020]
[ 7475.284841] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0012100 flags=0x0020]
[ 7475.285503] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0x7fbe0011600 flags=0x0020]
[ 7475.287354] AMD-Vi: IOMMU Event log restarting
[ 7475.292157] AMD-Vi: IOMMU Event log restarting
[ 7475.293352] AMD-Vi: IOMMU Event log restarting
[ 7475.299240] AMD-Vi: IOMMU Event log restarting
[ 7475.305167] AMD-Vi: IOMMU Event log restarting
[ 7475.353664] NVRM: Xid (PCI:0000:01:00): 31, pid=86045, name=llama-server, channel 0x00000003, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC11 GPCCLIENT_T1_11 faulted @ 0x5fd0_24000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Name and Version
llama-server
version: 8792 (f4b5bf2)
built with GNU 14.2.0 for Linux x86_64
NVIDIA-SMI version: 595.58.03
NVML version: 595.58
DRIVER version: 595.58.03
CUDA Version: 13.2
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA RTX PRO 6000 Blackwell Workstation Edition
AMD EPYC 7543P 32-Core Processor
Models
Qwen3.5-397B-A17B-UD-Q5_K_XL (https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF)
Gemma-4-31B-it-BF16 (https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF)
MiniMax-M2.7-Q8_0 (https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF)
Problem description & steps to reproduce
llama-server crashes or hangs on some new models.
Linux kernel logs AMD IOMMU I/O page‑faults and then a GPU MMU fault (Xid 31)
Git bisect
Between commit 85d482e (good) and
commit f4b5bf2 (bad)
At each git bisect step llama-cpp was re-built with:
Linux kernel (dmesg) logs are the same (see below) for all these models, while llama-server behaves different depending on the model:
Qwen3.5-397B
Crashes on first request.
Gemma-4-31B
Crashes on server start before first request.
MiniMax-M2.7
llama-server to hang on the first request, without any logged errors. dmesg shows the same IO_PAGE_FAULT however.
First Bad Commit
Relevant log output
Qwen3.5-397B
llama-server --host 192.168.88.249 --port 14128 --threads 8 --threads-http 2 --ctx-size 131072 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.00 --batch-size 2048 --ubatch-size 2048 -fa off --no-mmap --model /srv/d/.cache/hf/models--unsloth--Qwen3.5-397B-A17B-GGUF/snapshots/da33c16fa4440f831149fcf53b98a22bc07785e5/UD-Q5_K_XL/Qwen3.5-397B-A17B-UD-Q5_K_XL-00001-of-00008.ggufGemma-4-31B
llama-server -v --device CUDA0 --host 192.168.88.249 --port 14128 --threads 8 --threads-http 2 --ctx-size 131072 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.00 --batch-size 2048 --ubatch-size 2048 -fa off --no-mmap --model /srv/d/.cache/llama.cpp/models--bartowski--google_gemma-4-31B-it-GGUF/snapshots/a83d9d3efca681556dbd277fc50fde765041a3b0/google_gemma-4-31B-it-bf16/google_gemma-4-31B-it-bf16-00001-of-00002.ggufdmesg