normalization.h

LayerNorm and RMSNorm functions.

Enums

enum class NVTE_Norm_Type

Values:

enumerator LayerNorm

enumerator RMSNorm

Functions

void nvte_layernorm_fwd(const NVTETensor x, const NVTETensor gamma, const NVTETensor beta, const float epsilon, NVTETensor z, NVTETensor mu, NVTETensor rsigma, NVTETensor workspace, const int multiprocessorCount, const bool zero_centered_gamma, cudaStream_t stream)

Compute LayerNorm on the input.

The formula used:

\[ y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}} \gamma + \beta \]

Calling this function with workspace set to empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:

x – [in] Input tensor of shape [N, H].
gamma – [in] Gamma tensor of shape [H].
beta – [in] Beta tensor of shape [H].
epsilon – [in] Value added to denominator for numerical stability.
z – [inout] Output tensor of shape [N, H].
mu – [out] Mean of the input calculated over the last dimension. Shape: [N].
rsigma – [out] Inverse of the variance of the input calculated over the last dimension. Shape: [N].
workspace – [out] Workspace tensor.
multiprocessorCount – [in] Number of SMs in the device.
zero_centered_gamma – [in] Multiply normalized values by \( \gamma+1 \) instead of \( \gamma \)
stream – [in] CUDA stream used for the operation.

void nvte_layernorm_bwd(const NVTETensor dz, const NVTETensor x, const NVTETensor mu, const NVTETensor rsigma, const NVTETensor gamma, NVTETensor dx, NVTETensor dgamma, NVTETensor dbeta, NVTETensor workspace, const int multiprocessorCount, const bool zero_centered_gamma, cudaStream_t stream)

Compute backward of LayerNorm.

This function computes the gradient of function:

\[ y = \frac{x - E[x]}{\sqrt{Var[x] + \varepsilon}}\gamma + \beta \]

else with respect to \(x\), \(\gamma\) and \(\beta\).

Calling this function with workspace set to empty tensor will not perform the operation, but instead set the shape and type of these tensors to the required values.

Parameters:

dz – [in] Incoming gradient tensor of shape [N, H].
x – [in] Forward input tensor of shape [N, H].
mu – [in] Mean of the input calculated over the last dimension. Shape: [N].
rsigma – [in] Inverse of the variance of the input calculated over the last dimension. Shape: [N].
gamma – [in] Gamma tensor of shape [H].
dx – [out] Output gradient of shape [N, H].
dgamma – [out] Gradient for gamma tensor of shape [H].
dbeta – [out] Gradient for beta tensor of shape [H].
workspace – [out] Workspace tensor.
multiprocessorCount – [in] Number of SMs in the device.
zero_centered_gamma – [in] Multiply normalized values by \( \gamma+1 \) instead of \( \gamma \)
stream – [in] CUDA stream used for the operation.

void nvte_rmsnorm_fwd(const NVTETensor x, const NVTETensor gamma, const float epsilon, NVTETensor z, NVTETensor rsigma, NVTETensor workspace, const int multiprocessorCount, const bool zero_centered_gamma, cudaStream_t stream)

Compute RMSNorm.

The formula used:

\[ y = \frac{x}{RMS_\varepsilon(x)}\gamma \]

where

\[ RMS_\varepsilon(x) = \sqrt{\frac{1}{n}\sum_{i=0}^{n-1} x_i^2 + \varepsilon} \]

Calling this function with workspace and barrier set to empty tensor will not perform the operation, but instead set the shape and type of the workspace and barrier tensors to the required values.

Parameters:

x – [in] Input tensor of shape [N, H].
gamma – [in] Gamma tensor of shape [H].
epsilon – [in] Value added to denominator for numerical stability.
z – [inout] Output tensor of shape [N, H].
rsigma – [out] Reciprocal of the root mean square of the input calculated over the last dimension. Shape: [N].
workspace – [out] Workspace tensor.
multiprocessorCount – [in] Number of SMs in the device.
zero_centered_gamma – [in] Multiply normalized values by \( \gamma+1 \) instead of \( \gamma \)
stream – [in] CUDA stream used for the operation.

void nvte_rmsnorm_bwd(const NVTETensor dz, const NVTETensor x, const NVTETensor rsigma, const NVTETensor gamma, NVTETensor dx, NVTETensor dgamma, NVTETensor workspace, const int multiprocessorCount, const bool zero_centered_gamma, cudaStream_t stream)

Compute backward of RMSNorm.

This function computes the gradient of function:

\[ y = \frac{x}{RMS_\varepsilon(x)}\gamma \]

where

\[ RMS_\varepsilon(x) = \sqrt{\frac{1}{n}\sum_{i=0}^{n-1} x_i^2 + \varepsilon} \]

with respect to \(x\) and \(gamma\).

Calling this function with workspace, barrier, dgamma_part set to empty tensor will not perform the operation, but instead set the shape and type of these tensors to the required values.

Parameters:

dz – [in] Incoming gradient tensor of shape [N, H].
x – [in] Forward input tensor of shape [N, H].
rsigma – [in] Reciprocal of the root mean square of the input calculated over the last dimension. Shape: [N].
gamma – [in] Gamma tensor of shape [H].
dx – [out] Output gradient of shape [N, H].
dgamma – [out] Gradient for gamma tensor of shape [H].
workspace – [out] Workspace tensor.
multiprocessorCount – [in] Number of SMs in the device.
zero_centered_gamma – [in] Multiply normalized values by \( \gamma+1 \) instead of \( \gamma \)
stream – [in] CUDA stream used for the operation.

void nvte_enable_cudnn_norm_fwd(bool enable)

Helper to enable cuDNN backend for normalization.

Parameters:: bool – [in] Enable if True

void nvte_enable_cudnn_norm_bwd(bool enable)

void nvte_enable_zero_centered_gamma_in_weight_dtype(bool enable)

Control whether norm computes gamma += 1.0 for zero-centered gamma in weight dtype. If set to false, it will compute in compute dtype.

Currently this only applies to the CuDNN backend. If CuDNN is not used, this setting has no effect.

Parameters:: bool – [in] Enable if True