Understanding How CUDA_VISIBLE_DEVICES Works

I’m trying to understand how CUDA_VISIBLE_DEVICES behaves in a system with multiple A100 GPUs, some of which have MIG enabled.

Here’s the situation:

Tue Jul 15 16:01:30 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                   On |
| N/A   31C    P0              51W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                   On |
| N/A   30C    P0              47W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:48:00.0 Off |                   On |
| N/A   43C    P0             166W / 400W |   2195MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4C:00.0 Off |                   On |
| N/A   32C    P0              47W / 400W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:88:00.0 Off |                    0 |
| N/A   29C    P0              57W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:8B:00.0 Off |                    0 |
| N/A   32C    P0              60W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:C8:00.0 Off |                    0 |
| N/A   32C    P0              63W / 400W |   1517MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:CB:00.0 Off |                    0 |
| N/A   32C    P0              66W / 400W |  22419MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+



+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  2    1   0   0  |            2145MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               4MiB / 65535MiB  |           |                       |

When I set CUDA_VISIBLE_DEVICES, the processes end up using:

  • CUDA_VISIBLE_DEVICES=0 → physical GPU 4
  • CUDA_VISIBLE_DEVICES=1 → physical GPU 5
  • CUDA_VISIBLE_DEVICES=2 → physical GPU 6
  • CUDA_VISIBLE_DEVICES=3 → physical GPU 7
  • CUDA_VISIBLE_DEVICES=4 → a MIG instance on physical GPU 2
  • Beyond that (e.g., setting CUDA device 5 or higher), I get an error.

What I find particularly confusing is this part:

  • CUDA_VISIBLE_DEVICES=4 → a MIG instance on physical GPU 2

Is there a specific rule or logic that determines this kind of mapping?

I have no experience in this area, and I find the definition of CUDA_VISIBLE_DEVICES somewhat unclear.

I wonder if the reason your ordering starts at GPU4, is because GPUs 0-3 all have MIG enabled, even though you have only one MIG instance configured on GPU 2.

So the integer ordering starts with non MIG devices and then moves to MIG instances. Perhaps you need to use the UUID method to differentiate them?

I can target a specific MIG instance by using CUDA_VISIBLE_DEVICES=[MIG-UUID]. Additionally, I can target a specific physical GPU by using CUDA_VISIBLE_DEVICES=[GPU-UUID]. However, it’s not possible to target a MIG-enabled GPU using CUDA_VISIBLE_DEVICES=[GPU-UUID]. It seems that GPUs with MIG enabled cannot be specified directly by CUDA_VISIBLE_DEVICES. They are likely only recognized at the smaller MIG unit level.

Regarding the relationship between integer indices and GPU recognition, it appears they are assigned sequentially after listing all available physical GPUs and MIG instances. While my tests show this behavior, I haven’t been able to find any official references to confirm it.

The MIG document gives examples of selecting specific MIG devices. It doesn’t use or demonstrate using ordinary ordinal device IDs. I think that is expected. To quote:

CUDA_VISIBLE_DEVICES has been extended to add support for MIG. Depending on the driver versions being used, two formats are supported:

  1. With drivers >= R470 (470.42.01+), each MIG device is assigned a GPU UUID starting with MIG-<UUID>.
  2. With drivers < R470 (for example, R450 and R460), each MIG device is enumerated by specifying the CI and the corresponding parent GI. The format follows this convention: MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>.

(emphasis added)

There is no indication that using a device ordinal method is supported for proper selection of MIG devices.

1 Like