Eval bug: Fatal crash in `ggml_cuda_init`: "peer mapping resources exhausted" with 10 NVIDIA P40 GPUs

**Title:** Fatal crash in `ggml_cuda_init`: "peer mapping resources exhausted" with 10 GPUs

---

### What happened

`llama-server` crashes unconditionally on startup, even including `--help` which is before any arguments are parsed:

```
ggml_cuda_init: found 10 CUDA devices (Total VRAM: 229059 MiB):
  Device 0: Tesla P40, compute capability 6.1, VMM: yes, VRAM: 22905 MiB
  ...
CUDA error: peer mapping resources exhausted
  current device: 0, in function ggml_cuda_init at ggml/src/ggml-cuda/ggml-cuda.cu:336
  cudaDeviceEnablePeerAccess(id_other, 0)
Aborted (core dumped)
```

### System

- **Build:** b8783 (e21cdc11a), GNU 11.4.0, Linux x86_64
- **Previous working build:** b8688 (71a81f6fc)
- **GPUs:** 10× NVIDIA Tesla P40 (GP102, compute 6.1) connected via ASM2824 PCIe switch on a Gigabyte G431-MM0
- **CPU:** AMD EPYC 3151 (Zen 1)
- **cmake flags:** `-DGGML_CUDA=ON -DGGML_CUDA_FORCE_MMQ=ON -DCMAKE_CUDA_ARCHITECTURES=61`

### Root cause

The peer access loop in `ggml_cuda_init` (ggml-cuda.cu lines 326–339) iterates over all device pairs. With 10 GPUs that's 90 directional `cudaDeviceEnablePeerAccess` calls. `cudaDeviceCanAccessPeer` returns true for many pairs, but the CUDA driver has a hard limit on concurrent peer mappings per device. When that limit is exceeded, `cudaDeviceEnablePeerAccess` fails with `peer mapping resources exhausted` and the `CUDA_CHECK()` wrapper aborts the process.

```cpp
for (int id = 0; id < info.device_count; ++id) {
    ggml_cuda_set_device(id);
    for (int id_other = 0; id_other < info.device_count; ++id_other) {
        if (id == id_other) { continue; }
        int can_access_peer;
        CUDA_CHECK(cudaDeviceCanAccessPeer(&can_access_peer, id, id_other));
        if (can_access_peer) {
            CUDA_CHECK(cudaDeviceEnablePeerAccess(id_other, 0)); // fatal here
        }
    }
}
```

### Workaround

Replace the fatal `CUDA_CHECK` with a best-effort call:

```cpp
if (can_access_peer) {
    cudaDeviceEnablePeerAccess(id_other, 0);
    cudaGetLastError(); // clear error flag. Peer access is optional
}
```

This lets startup proceed. P2P mappings that succeed still work; those that fail silently fall back to staged host copies.

### Suggested fix

1. Make `cudaDeviceEnablePeerAccess` non-fatal as peer access is a performance hint, not a correctness requirement
2. Optionally add an environment variable (e.g. `GGML_CUDA_NO_PEER_ACCESS=1`) to skip the loop entirely

### Notes

- This likely affects any setup with 8+ GPUs where the driver reports peer capability but can't map all O(n²) pairs simultaneously which is common with PLX/ASM2824 switch topologies and potentially some NVLink configurations.
- On PCIe switch topologies, P2P "success" still routes through the host root complex anyway, so the performance benefit is negligible. Failing gracefully has no practical downside.
- `GGML_CUDA_NO_PEER_COPY=1` has no effect as the init code doesn't check any environment variable before the peer access loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Fatal crash in `ggml_cuda_init`: "peer mapping resources exhausted" with 10 NVIDIA P40 GPUs #21883

What happened

System

Root cause

Workaround

Suggested fix

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Fatal crash in ggml_cuda_init: "peer mapping resources exhausted" with 10 NVIDIA P40 GPUs #21883

Description

What happened

System

Root cause

Workaround

Suggested fix

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Eval bug: Fatal crash in `ggml_cuda_init`: "peer mapping resources exhausted" with 10 NVIDIA P40 GPUs #21883