I’m trying to understand how CUDA_VISIBLE_DEVICES
behaves in a system with multiple A100 GPUs, some of which have MIG enabled.
Here’s the situation:
Tue Jul 15 16:01:30 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | On |
| N/A 31C P0 51W / 400W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0B:00.0 Off | On |
| N/A 30C P0 47W / 400W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:48:00.0 Off | On |
| N/A 43C P0 166W / 400W | 2195MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4C:00.0 Off | On |
| N/A 32C P0 47W / 400W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:88:00.0 Off | 0 |
| N/A 29C P0 57W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:8B:00.0 Off | 0 |
| N/A 32C P0 60W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C8:00.0 Off | 0 |
| N/A 32C P0 63W / 400W | 1517MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:CB:00.0 Off | 0 |
| N/A 32C P0 66W / 400W | 22419MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 2 1 0 0 | 2145MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 4MiB / 65535MiB | | |
When I set CUDA_VISIBLE_DEVICES, the processes end up using:
- CUDA_VISIBLE_DEVICES=0 → physical GPU 4
- CUDA_VISIBLE_DEVICES=1 → physical GPU 5
- CUDA_VISIBLE_DEVICES=2 → physical GPU 6
- CUDA_VISIBLE_DEVICES=3 → physical GPU 7
- CUDA_VISIBLE_DEVICES=4 → a MIG instance on physical GPU 2
- Beyond that (e.g., setting CUDA device 5 or higher), I get an error.
What I find particularly confusing is this part:
CUDA_VISIBLE_DEVICES=4
→ a MIG instance on physical GPU 2
Is there a specific rule or logic that determines this kind of mapping?