Title: Fatal crash in ggml_cuda_init: "peer mapping resources exhausted" with 10 GPUs
What happened
llama-server crashes unconditionally on startup, even including --help which is before any arguments are parsed:
ggml_cuda_init: found 10 CUDA devices (Total VRAM: 229059 MiB):
Device 0: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 22905 MiB
...
CUDA error: peer mapping resources exhausted
current device: 0, in function ggml_cuda_init at ggml/src/ggml-cuda/ggml-cuda.cu:336
cudaDeviceEnablePeerAccess(id_other, 0)
Aborted (core dumped)
System
- Build: b8783 (e21cdc1), GNU 11.4.0, Linux x86_64
- Previous working build: b8688 (71a81f6)
- GPUs: 10× NVIDIA Tesla P40 (GP102, compute 6.1) connected via ASM2824 PCIe switch on a Gigabyte G431-MM0
- CPU: AMD EPYC 3151 (Zen 1)
- cmake flags:
-DGGML_CUDA=ON -DGGML_CUDA_FORCE_MMQ=ON -DCMAKE_CUDA_ARCHITECTURES=61
Root cause
The peer access loop in ggml_cuda_init (ggml-cuda.cu lines 326–339) iterates over all device pairs. With 10 GPUs that's 90 directional cudaDeviceEnablePeerAccess calls. cudaDeviceCanAccessPeer returns true for many pairs, but the CUDA driver has a hard limit on concurrent peer mappings per device. When that limit is exceeded, cudaDeviceEnablePeerAccess fails with peer mapping resources exhausted and the CUDA_CHECK() wrapper aborts the process.
for (int id = 0; id < info.device_count; ++id) {
ggml_cuda_set_device(id);
for (int id_other = 0; id_other < info.device_count; ++id_other) {
if (id == id_other) { continue; }
int can_access_peer;
CUDA_CHECK(cudaDeviceCanAccessPeer(&can_access_peer, id, id_other));
if (can_access_peer) {
CUDA_CHECK(cudaDeviceEnablePeerAccess(id_other, 0)); // fatal here
}
}
}
Workaround
Replace the fatal CUDA_CHECK with a best-effort call:
if (can_access_peer) {
cudaDeviceEnablePeerAccess(id_other, 0);
cudaGetLastError(); // clear error flag. Peer access is optional
}
This lets startup proceed. P2P mappings that succeed still work; those that fail silently fall back to staged host copies.
Suggested fix
- Make
cudaDeviceEnablePeerAccess non-fatal as peer access is a performance hint, not a correctness requirement
- Optionally add an environment variable (e.g.
GGML_CUDA_NO_PEER_ACCESS=1) to skip the loop entirely
Notes
- This likely affects any setup with 8+ GPUs where the driver reports peer capability but can't map all O(n²) pairs simultaneously which is common with PLX/ASM2824 switch topologies and potentially some NVLink configurations.
- On PCIe switch topologies, P2P "success" still routes through the host root complex anyway, so the performance benefit is negligible. Failing gracefully has no practical downside.
GGML_CUDA_NO_PEER_COPY=1 has no effect as the init code doesn't check any environment variable before the peer access loop.
Title: Fatal crash in
ggml_cuda_init: "peer mapping resources exhausted" with 10 GPUsWhat happened
llama-servercrashes unconditionally on startup, even including--helpwhich is before any arguments are parsed:System
-DGGML_CUDA=ON -DGGML_CUDA_FORCE_MMQ=ON -DCMAKE_CUDA_ARCHITECTURES=61Root cause
The peer access loop in
ggml_cuda_init(ggml-cuda.cu lines 326–339) iterates over all device pairs. With 10 GPUs that's 90 directionalcudaDeviceEnablePeerAccesscalls.cudaDeviceCanAccessPeerreturns true for many pairs, but the CUDA driver has a hard limit on concurrent peer mappings per device. When that limit is exceeded,cudaDeviceEnablePeerAccessfails withpeer mapping resources exhaustedand theCUDA_CHECK()wrapper aborts the process.Workaround
Replace the fatal
CUDA_CHECKwith a best-effort call:This lets startup proceed. P2P mappings that succeed still work; those that fail silently fall back to staged host copies.
Suggested fix
cudaDeviceEnablePeerAccessnon-fatal as peer access is a performance hint, not a correctness requirementGGML_CUDA_NO_PEER_ACCESS=1) to skip the loop entirelyNotes
GGML_CUDA_NO_PEER_COPY=1has no effect as the init code doesn't check any environment variable before the peer access loop.