0% found this document useful (0 votes)
284 views25 pages

NCP-AII Dumps Help You Pass NVIDIA Certified Professional AI Infrastructure Exam

The document provides detailed information about the NVIDIA Certified Professional AI Infrastructure (NCP-AII) exam, including sample questions and answers covering various topics related to NVIDIA technologies. It offers insights into exam preparation materials, discounts on exam dumps, and specific technical scenarios related to GPU management, network configurations, and deep learning frameworks. Additionally, it highlights the importance of proper configurations and troubleshooting techniques for optimizing performance in NVIDIA environments.

Uploaded by

buddyzabbo94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
284 views25 pages

NCP-AII Dumps Help You Pass NVIDIA Certified Professional AI Infrastructure Exam

The document provides detailed information about the NVIDIA Certified Professional AI Infrastructure (NCP-AII) exam, including sample questions and answers covering various topics related to NVIDIA technologies. It offers insights into exam preparation materials, discounts on exam dumps, and specific technical scenarios related to GPU management, network configurations, and deep learning frameworks. Additionally, it highlights the importance of proper configurations and troubleshooting techniques for optimizing performance in NVIDIA environments.

Uploaded by

buddyzabbo94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Exam Code: NCP-AII

Exam Name: NVIDIA Certified Professional AI Infrastructure

Associate Certification: NVIDIA-Certified Professional

Samples: 48Q&As

Save 40% on Full NCP-AII Exam Dumps with Coupon


“40PASS”

NCP-AII exam dumps provide the most effective material to study and
review all key NVIDIA Certified Professional AI Infrastructure topics. By
thoroughly practicing with NCP-AII exam dumps, you can build confidence
and pass the exam in a shorter time.

Practice NCP-AII exam online questions below.

1. When deploying BlueField OS using PXE boot, which of the following files on the PXE server
is responsible for specifying the kernel, initrd, and device tree files to be loaded by the client?
A. dhcpd .conf
B. pxelinux.cfg/default
C. tftpboot/pxelinux.0
D. /boot/grub/grub.cfg
E. tftpboot/lpxelinux.0
Answer: B
Explanation:
The ‘pxelinux.cfg/default’ file (or a similar configuration file based on the client’s MAC address
or IP address) contains the configuration directives for the PXE bootloader, including specifying
the kernel, initrd, and device tree files. ‘dhcpd.conf is for DHCP server configuration,
‘pxelinux.ff is the PXE bootloader itself, and ‘/boot/grub/grub.cfg’ is a GRUB configuration file,
usually on the client’s disk.

2. After upgrading the NGC CLI using ‘pip install ?upgrade nvidia-cli’, some commands are no
longer working as expected, producing errors related to missing modules.
What is the most likely reason for this issue and how can you resolve it?
A. The upgrade process might have corrupted the NGC CLI installation. Reinstall the package
using ‘pip install ?force-reinstall nvidia-cli’.
B. The NGC CLI upgrade introduced breaking changes. Review the NGC CLI release notes and
update your scripts accordingly.
C. The Python environment used by the NGC CLI might be broken or inconsistent. Create a
new virtual environment and reinstall the NGC CLI in the new environment.
D. The system’s PATH variable has not been updated to reflect the new NGC CLI installation
location. Update the PATH variable accordingly.
E. The host’s operating system must be re-imaged.
Answer: B,C
Explanation:
Breaking changes in the NGC CLI upgrade (B) are a possibility, requiring script updates. An
inconsistent Python environment (C) can also cause issues after an upgrade. Reinstalling (A) or
updating the PATH (D) might not resolve the issue if the environment itself is the problem. OS
re-imaging is highly unnecessary (E).

3. A user reports that their deep learning training job is crashing with a ‘CUDA out of memory’
error, even though ‘nvidia-smi’ shows plenty of free memory on the GPU. The job uses
TensorFlow.
What are the TWO most likely causes?
A. The TensorFlow version is incompatible with the installed NVIDIA driver.
B. TensorFlow is allocating memory on the CPU instead of the GPU.
C. TensorFlow is fragmenting GPU memory, making it difficult to allocate contiguous blocks.
D. The CUDA VISIBLE DEVICES environment variable is not set correctly.
E. The system’s swap space is full, preventing memory from being allocated.
Answer: C,D
Explanation:
‘CUDA out of memory errors, despite seemingly available GPU memory, often indicate memory
fragmentation or improper GPU assignment. TensorFlow can fragment GPU memory, leading to
allocation failures even if sufficient total memory is available. The variable controls which GPUs
TensorFlow can access. If it’s not set or is set incorrectly, TensorFlow might be trying to
allocate memory on a non-existent or unavailable GPU. While TensorFlow version
incompatibilities can cause issues, they are less likely to directly manifest as ‘CUDA out of
memory’ errors. TensorFlow typically prioritizes GPU memory allocation if configured correctly.

4. You are running a distributed training job on a multi-GPU server. After several hours, the job
fails with a NCCL (NVIDIA Collective Communications Library) error. The error message
indicates a failure in inter-GPU communication. ‘nvidia-smi’ shows all GPUs are healthy.
What is the MOST probable cause of this issue?
A. A bug in the NCCL library itself; downgrade to a previous version of NCCL.
B. Incorrect NCCL configuration, such as an invalid network interface or incorrect device affinity
settings.
C. Insufficient inter-GPU bandwidth; reduce the batch size to decrease communication
overhead.
D. A faulty network cable connecting the server to the rest of the cluster.
E. Driver incompatibility issue between NCCL and the installed NVIDIA driver version.
Answer: B,E
Explanation:
NCCL errors during inter-GPU communication often stem from configuration issues (B) or driver
incompatibilities (E). Incorrect network interface or device affinity settings can prevent proper
communication. Driver versions might not fully support the NCCL version being used. Reducing
batch size (C) might alleviate symptoms but doesn’t address the root cause. A faulty network
cable (D) would likely cause broader network issues beyond NCCL. Downgrading NCCL (A) is a
potential workaround but not the ideal first step.

5. Which of the following is a primary benefit of using a CLOS network topology (e.g., Spine-
Leaf) in a data center?
A. Reduced capital expenditure (CAPEX)
B. Increased network diameter
C. Improved scalability and bandwidth utilization
D. Simplified network management
E. Enhanced security
Answer: C
Explanation:
CLOS networks like Spine-Leaf provide excellent scalability due to their non-blocking
architecture, allowing for increased bandwidth utilization and easy expansion. CAPEX might be
higher due to more switches. The network diameter can be larger compared to traditional
topologies. While CLOS networks can be managed effectively, the management complexity can
be higher. Security benefits are not a primary characteristic of the CLOS topology itself.

6. You notice that one of the fans in your GPU server is running at a significantly higher RPM
than the others, even under minimal load. ipmitool sensor’ output shows a normal temperature
for that GPU.
What could be the potential causes?
A. The fan’s PWM control signal is malfunctioning, causing it to run at full speed.
B. The fan bearing is wearing out, causing increased friction and requiring higher RPM to
maintain airflow.
C. The fan is attempting to compensate for restricted airflow due to dust buildup.
D. The server’s BMC (Baseboard Management Controller) has a faulty temperature sensor
reading, causing it to overcompensate.
E. A network connectivity issue is causing higher CPU utilization, leading to increased system-
wide heat.
Answer: A,B,C
Explanation:
A malfunctioning PWM control signal, worn fan bearings, or restricted airflow can all cause a fan
to run at higher RPMs. While a faulty BMC sensor could be a cause, the question states that
‘ipmitool sensor’ shows a normal temperature. Network connectivity issues are less likely to
cause an isolated fan to run high, if the GPU temperature is normal.

7. You’re setting up a BlueField-3 DPIJ to offload storage virtualization tasks. Specifically, you
want to use SPDK (Storage Performance Development Kit) on the DPIJ.
What are the MINIMUM required steps to enable SPDK on the BlueField-3 after the DPIJ has
been flashed with the appropriate OS image? (Select TWO)
A. Install the SPDK packages using the DPU’s package manager (e.g., ‘apt install spdk’).
B. Configure the Huge Pages settings in the DPU’s kernel to allocate sufficient memory for
SPDK.
C. Download and compile the SPDK source code directly on the DPU.
D. Enable the SPDK service using ‘systemctl enable spdK and ‘systemctl start spdk’.
E. Configure the network interfaces on the DPIJ to support RDMA or NVMe-oF, depending on
the desired storage protocol.
Answer: A,B
Explanation:
The minimum steps involve installing the SPDK packages using the DPU’s package manager,
assuming prebuilt packages are available, and configuring Huge Pages. SPDK relies heavily on
Huge Pages for memory management. While configuring network interfaces and starting
services are important, installing SPDK and configuring Huge Pages are required first steps.
Downloading and compiling from source might be necessary in some cases but not minimally
required if packages are available.

8. Which of the following is the MOST critical consideration when planning the cooling strategy
for a server rack containing multiple NVIDIA A100 GPUs?
A. Ensuring the server room temperature is kept below 25 degrees Celsius.
B. Optimizing airflow to ensure hot air is efficiently exhausted from the rack and cool air is drawn
in.
C. Using liquid cooling for the CPUs, but air cooling for the GPUs.
D. Applying thermal paste to the GPU memory chips.
E. Increasing the fan speed of the server chassis fans to maximum.
Answer: B
Explanation:
While ambient temperature is important, optimized airflow is crucial for removing the heat
generated by the GPUs. Focusing solely on CPU cooling neglects the GPU heat. Applying
thermal paste to memory chips is generally unnecessary unless specifically recommended by
the manufacturer. Maximizing fan speed can help, but efficient airflow design is more effective.

9. You are tasked with configuring an NVIDIA NVLink Switch system. After physically
connecting the GPUs and the switch, what is the typical first step in the software configuration
process?
A. Installing the latest NVIDIA drivers on all connected GPUs.
B. Configuring the system BIOS to enable NVLink support.
C. Updating the firmware of the NVLink Switch.
D. Installing the NVLink Switch management software.
E. Running a memory bandwidth test between all connected GPUs.
Answer: C
Explanation:
Updating the NVLink Switch firmware is crucial for ensuring compatibility and stability with the
connected GPIJs and the overall system. It addresses potential bugs, security vulnerabilities,
and performance issues. It should always be done first before any other software configuration.
BIOS settings should be checked beforehand, and the NVLink management software comes
after the firmware update.

10. You are deploying a BlueField-3 DPU within a secure environment. You are required to
enable secure boot to prevent unauthorized firmware from being loaded.
Which steps are typically involved in enabling secure boot on a BlueField-3 DPU, starting from a
factory- default state? (Select TWO)
A. Install the latest version of the Mellanox OFED drivers on the host server.
B. Generate cryptographic keys and enroll them in the DPIJ’s IJEFI firmware using tools
provided by NVIDIA or Mellanox.
C. Configure the server’s BIOS to enable UEFI boot mode.
D. Flash the BlueField-3 DPIJ with a secure boot-enabled OS image that is signed with the
enrolled keys.
E. Update the DPU’s BMC firmware to the latest version.
Answer: B,D
Explanation:
The crucial steps are generating and enrolling cryptographic keys in the DPIJ’s UEFI firmware,
which establishes the chain of trust, and flashing the DPIJ with a secure boot-enabled OS
image signed with those keys. Without these, secure boot won’t function correctly. While
updating the BMC firmware (E) is generally a good practice, and enabling IJEFI boot mode in
the server’s BIOS (C) is necessary for UEFI in general, they are not specific steps for enabling
secure boot on the DPIJ itself. Installing host drivers (A) is also irrelevant to secure boot.

11. Consider a scenario where you need to run two different deep learning models, Model A
and Model B, within separate Docker containers on the same NVIDIA GPU. Model A requires
CUDA 11.2, while Model B requires CUDA 11.6.
How can you achieve this while minimizing conflicts and ensuring each model has access to its
required CUDA version?
A. Install both CUDA 11.2 and CUDA 11.6 on the host system and use ‘CUDA VISIBLE
DEVICES* to isolate each model to a specific CUDA version.
B. Use separate Docker images for each model, each based on the appropriate ‘nvidia/cuda’
image (e.g., ‘nvidia/cuda:ll .2-base-ubuntu20.04’ and nvidia/cuda: 1 1.6-base-ubuntu20.04 s).
C. Install both CUDA 11.2 and CUDA 11.6 inside each Docker container and use ‘LD LIBRARY
PATH’ to switch between the CUDA versions for each model.
D. Create a single Docker image with both CUDA versions and dynamically link the correct
CUDA libraries at runtime using environment variables.
E. Mount the CUDA libraries from the host machine into both containers using Docker volumes,
ensuring each container has access to both CUDA versions.
Answer: B
Explanation:
The recommended and most straightforward approach is to use separate Docker images (B),
each based on the specific nvidia/cuda’ image version needed. This creates isolated
environments, avoiding conflicts and ensuring each model has the correct CUDA toolkit.
Installing multiple CUDA versions on the host (A) can lead to conflicts and isn’t necessary with
Docker. Installing multiple CUDA versions within a single container (C, D) adds complexity and
potential conflicts. Mounting CUDA libraries from the host (E) might work, but it’s less isolated
and can create dependency management issues.

12. You are running a large-scale distributed training job on a cluster of AMD EPYC servers,
each equipped with multiple NVIDIAA100 GPUs. You are using Slurm for job scheduling. The
training process often fails with NCCL errors related to network connectivity.
What steps can you take to improve the reliability of the network communication for NCCL in
this environment? Choose the MOST appropriate answers.
A. Ensure that the InfiniBand or RoCE network is properly configured and that all servers can
communicate with each other over the network. Verify the network interface names and IP
addresses in the NCCL configuration.
B. Use the Slurm ‘srun’ command with the ‘?mpi=pmi2 option to launch the training job. This
ensures that Slurm properly initializes the MPl environment and sets the NCCL environment
variables.
C. Increase the ‘NCCL CONNECT TIMEOUT and *NCCL TIMEOUT environment variables to
allow for longer network delays.
D. Disable the firewall on all servers to allow unrestricted network communication.
E. Decrease the batch size to reduce the amount of data transferred over the network.
Answer: A,B,C
Explanation:
Ensuring network configuration is correct is the most important step. ‘srun’ with ‘?mpi=pmi2
handles NCCL environment variable setup by Slurm automatically for proper connectivity.
Increasing timeouts allows for transient network issues to resolve without causing failures.
Disabling the firewall is a security risk. Decreasing the batch size will reduce the amount of data
but won’t fix the core network connectivity issues.

13. You are tasked with ensuring optimal power efficiency for a GPU server running machine
learning workloads. You want to dynamically adjust the GPU’s power consumption based on its
utilization.
Which of the following methods is the MOST suitable for achieving this, assuming the server’s
BIOS and the NVIDIA drivers support it?
A. Manually set the GPU’s power limit using ‘nvidia-smi -pl and create a script to monitor
utilization and adjust the power limit periodically.
B. Configure the server’s BIOS/UEFI to use a power-saving profile, which will automatically
reduce the GPU’s power consumption when idle.
C. Enable Dynamic Boost in the NVIDIA Control Panel (if available), which will automatically
allocate power between the CPU and GPU based on their current needs.
D. Use NVIDIA’s Data Center GPU Manager (DCGM) to monitor GPU utilization and
dynamically adjust the power limit based on a predefined policy.
E. Disable ECC (Error Correcting Code) on the GPU to reduce power consumption.
Answer: D
Explanation:
DCGM provides the most comprehensive and automated solution for dynamic power
management. It can monitor GPIJ utilization in real-time and adjust the power limit based on
predefined policies, ensuring optimal power efficiency without manual intervention. Manually
adjusting the power limit is possible but requires scripting and continuous monitoring. Dynamic
Boost is typically for laptops, and BIOS power profiles may not be fine-grained enough.
Disabling ECC reduces power but compromises data integrity.

14. You are troubleshooting slow I/O performance in a deep learning training environment
utilizing BeeGFS parallel file system. You suspect the metadata operations are bottlenecking
the training process.
How can you optimize metadata handling in BeeGFS to potentially improve performance?
A. Increase the number of storage targets (OSTs) to distribute the data across more devices.
B. Implement data striping across multiple OSTs.
C. Increase the number of metadata servers (MDSs) and distribute the metadata load across
them.
D. Enable client-side caching of metadata on the training nodes.
E. Configure BeeGFS to use a different network protocol with lower overhead.
Answer: C
Explanation:
Metadata operations like file creation, deletion, and attribute modification can become a
bottleneck in parallel file systems. Increasing the number of metadata servers (MDSs) (option
C) and distributing the metadata load across them is the direct way to improve metadata
handling performance in BeeGFS.

15. You are tasked with validating the NVLink performance between GPUs in an NVIDIA DGXAI
00 system.
Which tool is the most appropriate for measuring the bandwidth and latency of NVLink
interconnections under a synthetic workload?
A. nvidia-smi
B. NCCL tests (e.g., nccl-tests/net_send_recv)
C. iostat
D. memtest86+
E. dmesg
Answer: B
Explanation:
NCCL tests, specifically, are designed for benchmarking the communication performance
between GPUs using NVLink. ‘nvidia-smr provides GPU monitoring information but not detailed
bandwidth/latency tests. ‘iostat’ is for 1/0 statistics. ‘memtest86’’ tests system memory.
‘dmesg’ displays kernel messages.

16. You are troubleshooting a performance issue with a GPU-accelerated application running
inside a Docker container. The ‘nvidia-smi’ output inside the container shows the GPU is being
utilized, but the performance is significantly lower than expected.
Which of the following could be the cause of this performance bottleneck?
A. The host machine’s CPU is being heavily utilized, causing a bottleneck in data transfer to the
GPU.
B. The Docker container is not configured to use shared memory for data transfer with the GPU.
C. The version of the CUDA driver on the host is incompatible with the CUDA toolkit version
used in the container.
D. The application is performing frequent small memory transfers between the CPU and GPIJ.
E. The GPU is overheating, causing thermal throttling.
Answer: A,C,D,E
Explanation:
Several factors could contribute to reduced GPU performance within a Docker container, even if
the GPU is being utilized. A heavily loaded CPU (A) can bottleneck data transfer to the GPU.
Incompatible CUDA driver versions between host and container (C) cause unexpected errors,
and CUDA drivers are important for GPU support. Frequent small memory transfers between
CPU and GPU (D) can be inefficient. Overheating (E) can cause the GPU to throttle its
performance. While shared memory optimization (B) can help, it’s not always the primary cause
of the initial performance drop.

17. Which of the following statements are correct regarding the use of NVIDIA GPUs with
Docker containers?
A. The NVIDIA Container Toolkit allows you to run GPU-accelerated applications in Docker
containers without modifying the container image.
B. You must install NVIDIA drivers inside the Docker container to enable GPU support.
C. The ‘nvidia-smr command can only be run on the host machine, not inside a Docker
container.
D. CUDA libraries are required inside the container if your application uses CODA.
E. Using environment variables like ‘CUDA VISIBLE DEVICES’ within the container can
influence which GPUs are accessible to the application.
Answer: A,D,E
Explanation:
The NVIDIA Container Toolkit allows GPU-accelerated apps to run in Docker without altering
the image. The host’s drivers are leveraged. CUDA libraries are necessary inside the container
if your app uses CUDA. is used to control GPU visibility within the container. Drivers are not
needed inside the container because they’re managed by the host (making B incorrect), and
‘nvidia-smi’ can be run inside containers if the NVIDIA Container Toolkit is properly set up
(making C incorrect).

18. You are tasked with creating a custom Docker image for a deep learning application that
requires a specific version of cuDNN. You want to minimize the image size while ensuring that
the cuDNN libraries are correctly installed and configured.
What is the most efficient way to achieve this?
A. Download the cuDNN archive from NVIDIA, extract the libraries, and manually copy them to
the appropriate locations within the Dockerfile.
B. Use a multi-stage Docker build, using a base image with the desired CUDA version for
building and then copying only the necessary cuDNN libraries to a smaller runtime image.
C. Install the entire CUDA toolkit within the Docker image, even if only cuDNN is needed.
D. Use the NVIDIA Container Toolkit to dynamically inject the cuDNN libraries into the container
at runtime.
E. Use a pre-built CUDA base image and install cuDNN during the container run.
Answer: B
Explanation:
A multi-stage Docker build (B) is the most efficient approach. It allows you to use a larger image
with the CUDA toolkit for building and then copy only the necessary cuDNN libraries to a smaller
runtime image, minimizing the final image size. Manually copying libraries (A) is tedious and
error-prone. Installing the entire CUDA toolkit (C) unnecessarily increases the image size. The
NVIDIA Container Toolkit (D) focuses on enabling GPU access, not dynamically injecting
specific libraries. Running an install of cuDNN during the container run is problematic since the
image should be self-contained.

19. You are using the BlueField DPU to offload encryption using IPsec. You want to ensure that
the cryptographic operations are being hardware accelerated.
Which command and output would BEST confirm that IPsec offload is active and being utilized?
A. ‘ip xfrm state’ - This command will output the current IPsec policy, but it doesn’t explicitly
show hardware acceleration.
B. ‘ipsec statusall’ - Shows IPsec connection status but not necessarily hardware acceleration.
C. ‘ethtool -k - Look for features like ‘tx-tcp-segmentation’ and ‘rx-checksumming’ being
offloaded to hardware, then correlate with IPsec configuration.
D. ‘dpdk-testpmd’ - Useful for testing DPDK-based applications, not directly indicative of IPsec
offload.
E. Examine Vproc/cryptor after setting up IPsec - This can show details about the crypto
algorithms used and may indicate hardware acceleration if a hardware engine is listed.
Answer: E
Explanation:
Examining S/proc/cryptor is the most direct method. After setting up IPsec, check this directory
(e.g., ‘/proc/crypto/aes-xts’) to see the details of the crypto algorithms being used. If hardware
acceleration is active, the output should show that a hardware crypto engine is being utilized.
xfrm state’ and ‘ipsec statusall’ provide connection information but not acceleration details.
‘ethtool -k’ shows general hardware offloads, but you’d need to infer the IPsec connection.
‘dpdk-testpmd’ is irrelevant here.
20. After installing the NGC CLI, you attempt to run ‘ngc config set’ and encounter the
following error: ‘Error: API key is invalid or missing’.
What are the most likely causes of this issue and how can you resolve them?
A. The NGC CLI is not properly installed. Reinstall the package using ‘pip install ?upgrade
nvidia-cli’
B. The NGC API key is incorrect or has expired. Verify the API key in your NVIDIA account and
update the configuration using ‘ngc config set’.
C. The NGC CLI configuration file is corrupted. Delete the file (A.ngc/config.json’) and
reconfigure the CLI.
D. The NGC service is down. Check the NVIDIA NGC status page for any known outages.
E. The host does not have network access to NGC.
Answer: B,C,E
Explanation:
The most likely cause is an invalid API key (B) or a corrupted configuration file (C), or the host
lacks network access (E). Reinstalling the package (A) might not resolve the issue if the
problem lies with the API key or config file. While NGC service outages (D) are possible, they
are less common.

21. Which command is used to verify the installation and configuration of the NGC CLI after
initial setup?
A. ‘ngc ?version’
B. ‘ngc config validate’
C. ‘ngc setup verify’
D. ‘ngc registry model list’
E. ‘ngc system statue’
Answer: A,D
Explanation:
‘ngc ?version’ confirms the CLI is installed and shows its version. ‘ngc registry model list’
verifies that the CLI can authenticate with NGC and access the registry.

22. Which of the following are key benefits of using NVIDIA Spectrum-X switches in an A1
infrastructure compared to traditional Ethernet switches? (Select THREE)
A. Lower cost per port.
B. Support for RoCE (RDMA over Converged Ethernet) and InfiniBand protocols, enabling high-
bandwidth, low-latency communication.
C. Advanced telemetry and monitoring capabilities for network performance optimization.
D. Hardware-based acceleration for collective communication operations used in distributed A1
training.
E. Native support for IPv6.
Answer: B,C,D
Explanation:
Spectrum-X switches are designed for high-performance computing and A1 workloads. They
support RoCE and InfiniBand for low- latency communication, offer advanced telemetry for
network optimization, and include hardware-based acceleration for collective communication
operations, improving the efficiency of distributed A1 training. While Spectrum-X supports IPv6,
this is also a common feature in modern Ethernet switches. Spectrum-X switches typically have
a higher cost per port compared to basic Ethernet switches due to their advanced features and
performance.

23. You have a deep learning application that requires a specific version of the CUDA toolkit
inside the container.
How should you best ensure that the correct CUDA version is available within the container,
considering the NVIDIA Container Toolkit is installed on the host?
A. Install the required CUDA toolkit version directly on the host operating system. The NVIDIA
Container Toolkit will automatically map it into the container.
B. Specify the desired CUDA version when running the container using the ‘?env flag. The
NVIDIA Container Toolkit will dynamically install the CUDA version during container startup.
C. Use a base image (e.g., from NVIDIA NGC) that already includes the desired CUDA toolkit
version. This approach provides a consistent and reproducible environment.
D. Manually copy the necessary CUDA libraries from the host into the container using ‘docker
cp’ before running the application.
E. Use the nvidia-container-cli to modify the existing image to install the proper cuda version.
Answer: C
Explanation:
The recommended approach is to use a base image that already contains the desired CUDA
version. NVIDIA provides pre-built images on NGC (NVIDIA GPU Cloud) that are specifically
designed for deep learning and include the appropriate CUDA versions and other
dependencies. Installing CUDA on the host and expecting it to be magically mapped (A) is not
reliable. The NVIDIA Container Toolkit doesn’t install CUDA on the fly (B). Manually copying
libraries (D) is error-prone and doesn’t handle dependencies well. While technically possible,
using nvidia- container-cli to modify the image is more complex than using a base image.
24. You suspect a power supply issue is causing intermittent GPU failures in a server with four
NVIDIAAIOO GPUs. The server is rated for a peak power consumption of 3000W. You have a
power meter available.
Which of the following methods provides the most accurate assessment of the server’s power
consumption under full GPU load?
A. Run ‘nvidia-smi’ and sum the reported power consumption for each GPIJ.
B. Use the power meter to measure the server’s power consumption at idle and multiply by
four.
C. Use the power meter to measure the server’s power consumption while running a synthetic
benchmark that fully utilizes all GPIJs simultaneously.
D. Check the server’s BIOS for power consumption readings.
E. Add the maximum power rating of each GPU to the CPU’s TDP (Thermal Design Power).
Answer: C
Explanation:
Measuring power consumption with a power meter while running a synthetic benchmark
provides the most accurate assessment. ‘nvidia-smi’ reports GPU power consumption, but it
doesn’t account for the power draw of other components (CPU, memory, etc.). Idle power
measurements are irrelevant for assessing peak load. BIOS readings can be unreliable. Simply
adding up component power ratings doesn’t account for inefficiencies or dynamic power
management.

25. You have installed an NVIDIA ConnectX-7 network adapter in an A1 server and configured
RDMA over Converged Ethernet (RoCE). During validation, you observe very high latency
between two servers communicating over RoCE.
Which of the following are potential causes? (Choose two)
A. Incorrect MTU size configuration on the network interfaces.
B. The network switch does not support RoCE.
C. The network cables are damaged.
D. The GPU driver is outdated.
E. Insufficient memory on the network adapter.
Answer: A,B
Explanation:
RoCE requires specific switch support and a properly configured MTU. Damaged cables could
cause packet loss, but usually not consistently high latency. GPU drivers are irrelevant. Network
adapter memory is unlikely to cause high latency unless extremely undersized, a less likely
scenario than incorrect configuration or lack of RoCE support.
26. You’ve deployed a GPU-accelerated application in Kubernetes using the NVIDIA device
plugin. However, your pods are failing to start with an error indicating that they cannot find the
NVIDIA libraries.
Which of the following could be potential causes of this issue? (Multiple Answers)
A. The NVIDIA drivers are not installed on the host node.
B. The ‘nvidia-container-runtime’ is not configured as the default runtime for
Docker/containerd.
C. The NVIDIA device plugin is not properly configured in the Kubernetes cluster.
D. The application container image does not include the necessary NVIDIA libraries.
E. The GPU’s compute capability is not sufficient for the workload.
Answer: A,B,C,D
Explanation:
If pods cannot find NVIDIA libraries, it could be because the drivers are missing on the host, the
container runtime is not configured to use the NVIDIA runtime, the NVIDIA device plugin is
misconfigured preventing GPU discovery and allocation, or the application container image
does not include the NVIDIA libraries. E is likely incorrect, if the GPU’s compute capability is
insufficient then the app would likely start, but throw an error when trying to use the GPU.

27. Your A1 inference server utilizes Triton Inference Server and experiences intermittent
latency spikes. Profiling reveals that the GPU is frequently stalling due to memory allocation
issues.
Which strategy or tool would be least effective in mitigating these memory allocation stalls?
A. Using CIJDA memory pools to pre-allocate memory and reduce allocation overhead during
inference requests.
B. Enabling CUDA graph capture to reduce kernel launch overhead.
C. Reducing the model’s memory footprint by using quantization or pruning techniques.
D. Increasing the GPU’s TCC (Tesla Compute Cluster) mode priority.
E. Optimize the model using TensorRT.
Answer: D
Explanation:
CUDA memory pools directly address memory allocation overhead. CUDA graph capture
reduces kernel launch overhead, which can indirectly reduce memory pressure. Model
quantization/pruning reduces the overall memory footprint. Optimizing using TensorRT reduces
memory footprint. Increasing TCC priority primarily affects preemption behavior and doesn’t
directly address memory allocation issues. Therefore it will have less impact than others.
28. You’re deploying a new cluster with multiple NVIDIAAIOO GPUs per node. You want to
ensure optimal inter-GPU communication performance using NVLink.
Which of the following configurations are critical for achieving maximum NVLink bandwidth?
A. All GPUs within a node must be the same model and have identical firmware versions.
B. The motherboard must support PCle Gen5 to maximize NVLink bandwidth.
C. GPUs should be physically installed in slots that maximize direct NVLink connections based
on the server’s architecture.
D. The NVIDIA driver must be configured to enable NVLink; it is disabled by default.
E. The server must use a specific CPU model to leverage NVLink capabilities.
Answer: A,C
Explanation:
For optimal NVLink performance, several conditions must be met. GPUs of the same model and
firmware ensure compatibility and prevent performance bottlenecks. Physical placement is
critical; GPUs must be installed in slots that maximize direct NVLink connections, as defined by
the server’s architecture and documentation. While PCle Gen5 is beneficial for overall system
performance, it does not directly impact NVLink bandwidth. NVLink is typically enabled by
default. Some CPU models may be preferable, but it’s the motherboard’s NVLink topology that
is more important.

29. You have a GPU-intensive application that requires the latest features of CUDA 12.
However, your host system’s NVIDIA driver is only compatible with CUDA 11.8.
What steps can you take to enable your application to use CUDA 12 within a Docker container,
without upgrading the host driver?
A. Upgrade the NVIDIA driver on the host system to the latest version compatible with CUDA
12.
B. Use a Docker image based on ‘nvidia/cuda:12.0-base-ubuntu20.04’. The NVIDIA Container
Toolkit will automatically handle the driver compatibility between the host and the container.
C. Install CUDA 12 inside the Docker container and set the ‘CUDA DRIVER VERSION’
environment variable to match the host driver version.
D. Mount the CUDA 12 libraries from a separate Docker volume into the container and configure
the accordingly.
E. Downgrade the application to use CUDA 11.8 to match the host’s driver version.
Answer: B
Explanation:
The NVIDIA Container Toolkit enables compatibility between the host driver and the CUDA
version within the container (B). Using a Docker image based on will allow your application to
leverage CUDA 12 features. Upgrading the host driver (A) is an option but not necessary and
may introduce other compatibility issues. Setting (C) is not a standard or reliable approach.
Mounting CUDA libraries from a volume (D) is complex and might not resolve driver version
mismatches. Downgrading the application (E) avoids the problem but sacrifices access to
CUDA 12 features. Because NVIDIA ensures a degree of backwards compatibility, the newer
toolkit in the container can often work with an older driver on the host.

30. You encounter an error during the BlueField OS flashing process using ‘bfboot: ‘ERROR:
Could not detect a BlueField device’.
Which of the following steps is MOST likely to resolve the issue?
A. Ensure the BlueField device is powered on and properly connected to the host system via
PCIe.
B. Update the ‘bfboot’ utility to the latest version. Older versions may have compatibility issues.
C. Install the Mellanox OFED drivers on the host system. These drivers are required for
‘bfboot’ to function correctly.
D. Verify that the correct PCIe slot is being used. Some systems may have specific slots
designated for SmartNICs.
E. Check the systems BIOS/UEFI to confirm that SR-IOV is enabled.
Answer: A
Explanation:
The most basic and common cause of this error is a physical connection issue. Ensuring the
device is powered and connected is the first step. While other options might be relevant in
specific scenarios, connectivity is the most probable cause.

31. You’re working with a large dataset of microscopy images stored as individual TIFF files.
The images are accessed randomly during a training job. The current storage solution is a
single HDD. You’re tasked with improving data loading performance.
Which of the following storage optimizations would provide the GREATEST performance
improvement in this specific scenario?
A. Implementing data deduplication on the storage volume.
B. Migrating the data to a large, sequential HDD.
C. Replacing the HDD with a RAID 5 array of HDDs.
D. Replacing the HDD with a single NVMe SSD.
E. Compressing the TIFF files using a lossless compression algorithm.
Answer: D
Explanation:
Random access to numerous small files is a classic use case where SSDs excel due to their
low latency. Replacing the HDD with an NVMe SSD (option D) will provide the most significant
performance improvement. Data deduplication (A) saves storage space but doesn’t directly
improve random access speed. Migrating to a sequential HDD (B) is counterproductive for
random access. RAID 5 (C) provides some performance improvement but not as much as an
SSD. Compression (E) can reduce storage space but adds overhead during decompression.

32. Run GPU diagnostics.


Answer: C
Explanation:
Checking temperature is crucial first to avoid damaging the GPU if it’s overheating. Reseating
addresses potential connectivity issues. Running diagnostics identifies hardware faults.
Updating the driver should be done after hardware checks to ensure the card isn’t faulty.

33. You are working with a BlueField-3 DPIJ and wish to programmatically control the PCle link
speed and width.
Which interface exposes the most direct way to manage these low-level hardware settings on
the DPU?
A. The standard Linux ethtoor utility.
B. The NVIDIA Management Library (NVML).
C. Directly accessing the PCle configuration space via ‘/sys/bus/pci/..:.
D. Using the Mellanox mlxconfig utility or its equivalent.
E. Modifying the device tree blob (DTB) and rebooting the DPIJ.
Answer: D
Explanation:
The ‘mlxconfig’ utility (or its equivalent for newer BlueField generations) is specifically
designed for configuring Mellanox/NVlDlA network adapters, including BlueField DPUs. It
provides access to a wide range of hardware settings, including PCle link speed and width.
NVML is primarily for GPU management, not DPIJ configuration. ‘ethtoor manages network
interface settings, but not PCle link parameters. Accessing the PCle configuration space directly
is possible but complex and risky. Modifying the DTB is a more permanent and low-level
approach, typically used during initial board configuration, not for runtime adjustments.

34. You are setting up a BlueField-2 SmartNIC and want to offload network functions.
Which of the following are valid methods for enabling hardware offload capabilities?
A. Using the ‘ethtoor command to enable specific offload features like checksum offload, TCP
segmentation offload (TSO), and UDP fragmentation offload (UFO).
B. Modifying the device tree to enable specific hardware features.
C. Installing and configuring the appropriate Mellanox OFED drivers, which automatically enable
many hardware offload features.
D. Running a custom script that programs the hardware offload engines directly.
E. Recompiling the Linux Kernel with the correct compilation flags.
Answer: A,C
Explanation:
The ‘ethtoor command is used to configure various network interface settings, including
enabling/disabling hardware offload features. Installing the correct Mellanox OFED drivers is
crucial, as they provide the necessary modules and tools to utilize the hardware offload
capabilities. While device tree modification can influence hardware behavior, it’s less common
and typically handled by driver configuration. A custom script directly programming the
hardware is unlikely and driver recompilation may be required, but often isn’t necessary with
default settings.

35. You are configuring a RoCEv2 (RDMA over Converged Ethernet) network using BlueField-2
DPUs. You are observing packet loss and performance degradation. You suspect that
Congestion Control is not working correctly.
What configuration parameter most directly impacts RoCEv2 congestion control behavior?
A. MTU size on the RoCEv2 interfaces.
B. PFC (Priority Flow Control) configuration on the switch ports.
C. ECN (Explicit Congestion Notification) configuration on the switch ports and DPU interfaces.
D. The number of RDMA queues configured on the DPU.
E. The IOMMIJ configuration for the DPU.
Answer: C
Explanation:
ECN is the key mechanism for RoCEv2 congestion control. It allows network devices to signal
congestion to the endpoints, which can then reduce their transmission rate. Proper ECN
configuration on both the switches and the DPIJ interfaces is essential for effective congestion
control. While PFC can prevent packet loss due to buffer overflow, it doesn’t address
congestion in the same way as ECN. The other options are less directly related to RoCEv2
congestion control.

36. You are deploying a new NVLink Switch based cluster. The GPUs are installed in different
servers, but need to be configured to utilize
NVLink interconnect.
Which of the following should be performed during the installation phase to confirm correct
configuration?
A. Run NCCL tests to verify the GPU-to-GPU bandwidth and latency between servers.
B. Verify that GPUDirect RDMA is enabled and functioning correctly.
C. Check that the ‘nvidia-sm’ command shows the correct NVLink topology.
D. Run standard TCP/IP network bandwidth tests to check inter-server communication.
E. All the GPU’s are in the same IP subnet
Answer: A,B,C
Explanation:
NCCL tests are specifically designed to test GPU-to-GPU communication. Ensuring GPUDirect
RDMA is functioning is essential for low-latency communication. ‘nvidia-smi’ should display the
NVLink topology. TCP/IP tests do not test the NVLink connection. It does not matter if GPUs on
different servers are in the same IP subnet as NVLink communication occur directly between
the GPUs using RDMA mechanism. Subnetting affects traditional networking layer
communication, but not low-level device communication.

37. Reseat the GPIJ.


38. You are evaluating different parallel file systems for an AI training cluster. You need a file
system that supports POSIX compliance and offers high bandwidth and low latency.
Which of the following options are viable candidates?
A. BeeGFS
B. GiusterFS
C. Ceph
D. Lustre
E. NFS
Answer: A,D
Explanation:
BeeGFS and Lustre are designed for high-performance computing and AI workloads, offering
high bandwidth, low latency, and POSIX compliance. GlusterFS and Ceph are more general-
purpose distributed file systems. NFS is generally not suitable for demanding AI workloads due
to its performance limitations.

39. Which protocol is commonly used in Spine-Leaf architectures for dynamic routing and load
balancing across multiple paths?
A. STP (Spanning Tree Protocol)
B. OSPF (Open Shortest Path First)
C. VRRP (Virtual Router Redundancy Protocol)
D. ECMP (Equal-Cost Multi-Path)
E. BGP (Border Gateway Protocol)
Answer: D
Explanation:
ECMP (Equal-Cost Multi-Path) is crucial for efficiently utilizing the multiple paths available in a
Spine-Leaf architecture. It allows traffic to be distributed across these paths, improving
throughput and reducing congestion. OSPF and BGP can be used for routing but do not
inherently provide per-packet load balancing. STP is used to prevent loops, and VRRP provides
router redundancy, neither of which directly address load balancing across multiple equal-cost
paths.

40. You are tasked with installing the NGC CLI on a host that does not have direct internet
access. You have downloaded the NGC CLI package to a local repository.
Which of the following steps are required to successfully install and configure the NGC CLI in
this offline environment?
A. Transfer the NGC CLI package to the host and install it using ‘pip install .whl’.
B. Configure the NGC CLI to point to your local package repository by setting the environment
variable.
C. Manually download and install all dependencies of the NGC CLI package using ‘pip install
--no-index --find-links=/path/to/dependencies .whl’.
D. Run ‘ngc config set’ to configure the API key, pointing to a local configuration file.
E. Only copying the whl file is sufficient, NGC CLI dependencies are always local
Answer: A,B,C,D
Explanation:
In an offline environment, you need to install the package locally (A), configure the CLI to know
where to find the package (B), manually install dependencies (C), and configure the API key
(D).
Option E is wrong because dependencies must be handled manually in the offline environment.

41. Reseat the GPIJ.


E. 1. Check the power supply connections.
42. You are tasked with optimizing an Intel Xeon scalable processor-based server running a
TensorFlow model with multiple NVIDIA GPUs.
You observe that the CPU utilization is low, but the GPU utilization is also not optimal. The
profiler shows significant time spent in ‘tf.data’ operations.
Which of the following actions would MOST likely improve performance?
A. Increase the number of threads used for CPU-bound operations in TensorFlow using
‘tf.config.threading.set_intra_op_parallelism_threads()’.
B. Enable XLA (Accelerated Linear Algebra) compilation in TensorFlow.
C. Use ‘tf.data.AUTOTIJNE to allow TensorFlow to dynamically optimize the data pipeline.
D. Reduce the global batch size to improve memory utilization.
E. Upgrade the server’s network adapter to a faster interface, such as 100GbE.
Answer: C
Explanation:
‘tf.data’ performance issues often stem from inefficient data pipelines. ‘tf.data.AIJTOTUNE
allows TensorFlow to dynamically optimize the pipeline by adjusting parameters such as
prefetch buffer size and the number of parallel calls to transformation functions. XLA compilation
optimizes graph execution, but ‘tf.data’ issues need to be addressed first. Increasing CPU
threads might help but ‘AUTOTUNE is more specific to the problem. A smaller batch size could
negatively impact GPU utilization. Network upgrades are irrelevant as the problem lies within
the server.

43. You want to limit the GPU memory available to a specific Docker container running a deep
learning model.
Which of the following ‘docker run’ commands using the NVIDIA Container Toolkit is the most
appropriate?
A. docker run ?gpus ‘device=0:memory=4g" my-image
B. docker run ?gpus all ?memory 4g my-image
C. docker run ?gpus device=0 --memory 4g my-image
D. docker run ?gpus ‘device=GPU-UUlD:memory=4g" my-image
E. docker run ?gpus all,memory=4g my-image
Answer: D
Explanation:
The correct syntax to limit GPU memory for a Docker container using the NVIDIA Container
Toolkit involves specifying the GPU device using its UUID and then setting the memory limit.
‘docker run --gpus ‘"device=GPU-UUlD:memory=4g"‘ my-image’ is the correct way to achieve
this. The others are syntactically incorrect or do not utilize the intended functionality. You need
to find the GPU UUID using ‘nvidia-smi’ and replace ‘GPIJ-IJIJID’ with it.
44. You have a large dataset stored on a BeeGFS file system. The training job is single node
and uses data augmentation to generate more data on the fly. The data augmentation process
is CPU-bound, but you notice that the GPU is underutilized due to the training data not being
fed to the GPU fast enough.
How can you reduce the load on the CPU and improve the overall training throughput?
A. Move the training data to a local NVMe drive on the training node.
B. Increase the number of BeeGFS metadata servers (MDSs) to improve metadata
performance.
C. Implement asynchronous 1/0 in the data loading pipeline using a library like NVIDIA DALI to
offload data processing tasks from the CPU to the GPU.
D. Decrease the batch size of the training job to reduce the amount of data being processed at
each iteration.
E. Enable data compression on the BeeGFS file system to reduce the amount of data being
transferred over the network.
Answer: C
Explanation:
Using NVIDIA DALI (option C) allows you to offload data augmentation and preprocessing tasks
from the CPU to the GPU, freeing up CPU resources for other tasks and enabling faster data
loading. Moving to a local NVMe drive (A) bypasses BeeGFS but doesn’t address the CPU
bottleneck. Increasing MDSs (B) improves metadata performance but doesn’t directly help with
the CPU-bound data augmentation. Decreasing the batch size (D) reduces the workload but
doesn’t solve the underlying CPU bottleneck. Data compression (E) can increase CPU load
due to the decompression process.

45. An A1 server exhibits frequent kernel panics under heavy GPU load. ‘dmesg’ reveals the
following error: ‘NVRM: Xid (PCl:0000:3B:00): 79, pid=..., name=..., GPU has fallen off the
bus.’
Which of the following is the least likely cause of this issue?
A. Insufficient power supply to the GPIJ, causing it to become unstable under load.
B. A loose or damaged PCle riser cable connecting the GPU to the motherboard.
C. A driver bug in the NVIDIA drivers, leading to GPU instability.
D. Overclocking the GPU beyond its stable limits.
E. A faulty CPU.
Answer: E
Explanation:
The error message GPU has fallen off the bus strongly suggests a hardware-related issue with
the GPU’s connection to the motherboard or its power supply. Insufficient power, a loose riser
cable, driver bugs and overclocking can all lead to this. A faulty CPU, while capable of causing
system instability, is less directly related to the GPIJ falling off the bus and therefore the least
likely cause in this specific scenario.

46. You are tasked with setting up network fabric ports to connect several servers, each with
multiple NVIDIA GPUs, to an InfiniBand switch. Each server has two ConnectX-6 adapters.
What is the best strategy to maximize bandwidth and redundancy between the servers and the
InfiniBand fabric?
A. Connect only one adapter from each server to the switch to minimize cable clutter.
B. Connect both adapters from each server to the same switch, but do not configure link
aggregation.
C. Connect both adapters from each server to the same switch and configure link aggregation
(LACP or static LAG) on both the server and the switch.
D. Connect one adapter from each server to one switch, and the second adapter to a different
switch, without link aggregation.
E. Connect one adapter from each server to one switch, and the second adapter to a different
switch, and configure multi-pathing on the servers.
Answer: E
Explanation:
Connecting each adapter to a different switch and configuring multi-pathing provides the highest
level of bandwidth and redundancy. Link aggregation to the same switch improves bandwidth
but doesn’t provide redundancy if that switch fails. Connecting only one adapter obviously limits
bandwidth. Multi-pathing allows the servers to use both adapters simultaneously, increasing
bandwidth, and provides automatic failover if one of the switches or links fails.

47. A BlueField-3 DPUis configured to run both control plane and data plane functions. After a
recent software update, you notice that the data plane performance has significantly degraded,
but the control plane remains responsive.
What is the MOST likely cause, assuming the update didn’t introduce any code bugs, and what
is the BEST approach to diagnose this issue?
A. Resource contention; Use ‘perf or ‘bpftrace’ to profile the data plane processes and identify
resource bottlenecks (CPU, memory, cache).
B. Driver incompatibility; Downgrade the Mellanox OFED drivers to the previous version.
C. Firmware corruption; Re-flash the BlueField DPIJ with the latest firmware image.
D. Network misconfiguration; Verify the MTU and QOS settings on the network interfaces.
E. Power throttling; Check the DPU’s power consumption and thermal status via the BMC.
Answer: A
Explanation:
Resource contention is the MOST likely cause, assuming no code bugs. The update may have
increased the resource demands of either the control or data plane, leading to contention.
Profiling the data plane processes with ‘perf or ‘bpftrace’ helps pinpoint the bottlenecks.
Downgrading drivers or reflashing firmware are more drastic steps to take after confirming
resource contention isn’t the issue.

48. You’re monitoring the storage I/O for an AI training workload and observe high disk
utilization but relatively low CPU utilization.
Which of the following actions is LEAST likely to improve the performance of the training job?
A. Switching from HDDs to NVMe SSDs.
B. Implementing data prefetching to load data into memory before it’s needed.
C. Increasing the batch size of the training job.
D. Adding more RAM to the system.
E. Reducing the number of parallel data loading threads.
Answer: E
Explanation:
High disk utilization and low CPU utilization indicate an 1/0 bottleneck. Switching to faster
storage (A), prefetching data (B), increasing the batch size (C), and adding more RAM (D) can
all help alleviate the I/O bottleneck. Reducing the number of parallel data loading threads (E)
would likely worsen the bottleneck by underutilizing the available 1/0 bandwidth.

Powered by TCPDF (www.tcpdf.org)

You might also like