When I ran “nvidia-smi -q” on A2000, I found the temperature messages.
Maximum Operating Temperature:
GPU Target Temperature:
What are these differences?
I understand that
Maximum Operating Temperature: GPU clocks drop when approaching this temperature.
GPU Target temperature: GPU fan is controlled to target this temperature.
You are mostly correct. For GPUs with active (fan) cooling, you will see these (example) outputs:
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 95 C
GPU Slowdown Temp : 92 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : 83 C
Clock throttling will happen at Slowdown temp, while Max operating and Target are the Min/Max values for the active cooling control. It depends of course on your specific GPU, but this would for example control fan speed. With newer drivers you are also able to set the target temperature (within GPU allowed limits) through nvidia-smi. This should also be stated in the documentation of nvidia-smi.
I am getting instances of throttle reason: HwThermal at 90C and theres no other indicators.
Running nvidia-smi -q on the A4000s
Temperature
GPU Current Temp : 53 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 103 C
GPU Slowdown Temp : 100 C
GPU Max Operating Temp : 98 C
GPU Target Temperature : 90 C
HW Slowdown HW Slowdown (reducing the core clocks by a factor of 2
or more) is engaged.
This is an indicator of:
Temperature being too high
External Power Brake Assertion is triggered (e.g. by
the system power supply)
Power draw is too high and Fast Trigger protection is
reducing the clocks
Would the HW Slowdown correspond to the Slowdown Temp? If so, this seems to imply that there is throttling by less than a factor of 2 at lower than Slowdown temperatures.
Could you help me understand the output of nvidia-smi?
Hi @derek.lee and welcome to the NVIDIA developer forums.
With thermal throttling there is always some form of hysteresis. That means the moment that throttling is enabled your GPU at the hotspot might have reached the Slowdown temp already and reduced its temperature due to throttling. The GPU will, with the help of the fans, always try to sty within the target temperature and below max operating temp.
You should look into improving your cooling solution.