Skip to content

Eval bug: Fatal crash in ggml_cuda_init: "peer mapping resources exhausted" with 10 NVIDIA P40 GPUs #21883

@david-bue

Description

@david-bue

Title: Fatal crash in ggml_cuda_init: "peer mapping resources exhausted" with 10 GPUs


What happened

llama-server crashes unconditionally on startup, even including --help which is before any arguments are parsed:

ggml_cuda_init: found 10 CUDA devices (Total VRAM: 229059 MiB):
  Device 0: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 22905 MiB
  ...
CUDA error: peer mapping resources exhausted
  current device: 0, in function ggml_cuda_init at ggml/src/ggml-cuda/ggml-cuda.cu:336
  cudaDeviceEnablePeerAccess(id_other, 0)
Aborted (core dumped)

System

  • Build: b8783 (e21cdc1), GNU 11.4.0, Linux x86_64
  • Previous working build: b8688 (71a81f6)
  • GPUs: 10× NVIDIA Tesla P40 (GP102, compute 6.1) connected via ASM2824 PCIe switch on a Gigabyte G431-MM0
  • CPU: AMD EPYC 3151 (Zen 1)
  • cmake flags: -DGGML_CUDA=ON -DGGML_CUDA_FORCE_MMQ=ON -DCMAKE_CUDA_ARCHITECTURES=61

Root cause

The peer access loop in ggml_cuda_init (ggml-cuda.cu lines 326–339) iterates over all device pairs. With 10 GPUs that's 90 directional cudaDeviceEnablePeerAccess calls. cudaDeviceCanAccessPeer returns true for many pairs, but the CUDA driver has a hard limit on concurrent peer mappings per device. When that limit is exceeded, cudaDeviceEnablePeerAccess fails with peer mapping resources exhausted and the CUDA_CHECK() wrapper aborts the process.

for (int id = 0; id < info.device_count; ++id) {
    ggml_cuda_set_device(id);
    for (int id_other = 0; id_other < info.device_count; ++id_other) {
        if (id == id_other) { continue; }
        int can_access_peer;
        CUDA_CHECK(cudaDeviceCanAccessPeer(&can_access_peer, id, id_other));
        if (can_access_peer) {
            CUDA_CHECK(cudaDeviceEnablePeerAccess(id_other, 0)); // fatal here
        }
    }
}

Workaround

Replace the fatal CUDA_CHECK with a best-effort call:

if (can_access_peer) {
    cudaDeviceEnablePeerAccess(id_other, 0);
    cudaGetLastError(); // clear error flag. Peer access is optional
}

This lets startup proceed. P2P mappings that succeed still work; those that fail silently fall back to staged host copies.

Suggested fix

  1. Make cudaDeviceEnablePeerAccess non-fatal as peer access is a performance hint, not a correctness requirement
  2. Optionally add an environment variable (e.g. GGML_CUDA_NO_PEER_ACCESS=1) to skip the loop entirely

Notes

  • This likely affects any setup with 8+ GPUs where the driver reports peer capability but can't map all O(n²) pairs simultaneously which is common with PLX/ASM2824 switch topologies and potentially some NVLink configurations.
  • On PCIe switch topologies, P2P "success" still routes through the host root complex anyway, so the performance benefit is negligible. Failing gracefully has no practical downside.
  • GGML_CUDA_NO_PEER_COPY=1 has no effect as the init code doesn't check any environment variable before the peer access loop.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions