Layer Normalization（层规范化）

果壳小旋子

已于 2025-06-23 13:24:17 修改

阅读量776

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Restormer 机器学习文章标签：深度学习人工智能 transformer LN pytorch

于 2023-08-24 15:41:28 首次发布

本文链接：https://blog.csdn.net/m0_47867419/article/details/132472020

机器学习同时被 2 个专栏收录

29 篇文章

订阅专栏

Restormer

5 篇文章

订阅专栏

本文介绍了层规范化（LayerNormalization,LN），一种在训练深度神经网络时减少计算时间并适用于RNN的方法。与批量规范化相比，LN无需依赖于批次大小，且能直接处理小样本和在线学习场景。它通过直接根据隐藏层内神经元的输入进行规范化，从而提高稳定性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

详细内容在这篇论文：Layer Normalization

训练深度神经网络需要大量的计算，减少计算时间的一个有效方法是规范化神经元的活动，例如批量规范化BN（batch normalization）技术，然而，批量规范化对小批量大小（batch size）敏感并且无法直接应用到RNN中（recurrent neural networks），为了解决上述问题，层规范化LN（Layer Normalization）被提出，不仅能直接应用到RNN，还能显著减少训练时间。与批量归一化不同，层规范化直接根据隐藏层内神经元的总输入估计归一化统计数据，因此不会在训练案例之间引入任何新的依赖关系。

🍎 背景

A feed-forward neural network is a non-linear mapping from a input pattern $\mathbf{x}$ to an output vector $y$ . Consider the $l^{\text {th }}$ hidden layer in a deep feed-forward, neural network, and let $a^l$ be the vector representation of the summed inputs to the neurons in that layer. $a_i^l$ 是第 $l$ 层第 $i$ 个神经元的线性加权输出。 The summed inputs are computed through a linear projection with the weight matrix $W^l$ and the bottom-up inputs $h^l$ given as follows:
$a_i^l=w_i^{l^{\top}} h^l \quad h_i^{l+1}=f\left(a_i^l+b_i^l\right)$

where $f(\cdot)$ is an element-wise non-linear function（激活函数） and $w_i^l$ is the incoming weights to the $i^{t h}$ hidden units and $b_i^l$ is the scalar bias parameter. The parameters in the neural network are learnt using gradient-based optimization algorithms with the gradients being computed by back-propagation.

🍌 Batch Normalization

BN是为了减少协变量偏移提出的，它在训练阶段对隐神经元加权输出进行规范化，例如，对于 $l^{th}$ 层的 $i^{th}$ 个加权输出 $a_i^l$ ，BN根据输入数据的分布进行了缩放
$\bar{a}_i^l=\frac{g_i^l}{\sigma_i^l}\left(a_i^l-\mu_i^l\right) \quad \mu_i^l=\underset{\mathbf{x} \sim P(\mathbf{x})}{\mathbb{E}}\left[a_i^l\right] \quad \sigma_i^l=\sqrt{\underset{\mathbf{x} \sim P(\mathbf{x})}{\mathbb{E}}\left[\left(a_i^l-\mu_i^l\right)^2\right]}$

where $\bar{a}_i^l$ is normalized summed inputs to the $i^{t h}$ hidden unit in the $l^{t h}$ layer and $g_i$ is a gain parameter scaling the normalized activation before the non-linear activation function.

实际中不会计算真正的 $\mu$ 和 $\sigma$ ，转而去估计一个batch里的 $\mu$ 和 $\sigma$ ，所以BN要求这个batchsize不能太小。然而，在一些在线学习任务以及超大分布模型中往往需要很小的batchsize。

🍑 Layer Normalization

$\mu^l=\frac{1}{H} \sum_{i=1}^H a_i^l \quad \sigma^l=\sqrt{\frac{1}{H} \sum_{i=1}^H\left(a_i^l-\mu^l\right)^2}$

$H$ 是一个隐藏层中的隐藏单元数量。在LN中，同一个层共享 $\mu$ 和 $\sigma$ ， but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of a mini-batch and it can be used in the pure online regime with batch size 1.

In a standard RNN, the summed inputs in the recurrent layer are computed from the current input $\mathbf{x}^t$ and previous vector of hidden states $\mathbf{h}^{t-1}$ which are computed as $\mathbf{a}^t=W_{h h} h^{t-1}+W_{x h} \mathbf{x}^t$ . The layer normalized recurrent layer re-centers and re-scales its activations using the extra normalization terms ：
$\mathbf{h}^t=f\left[\frac{\mathbf{g}}{\sigma^t} \odot\left(\mathbf{a}^t-\mu^t\right)+\mathbf{b}\right] \quad \mu^t=\frac{1}{H} \sum_{i=1}^H a_i^t \quad \sigma^t=\sqrt{\frac{1}{H} \sum_{i=1}^H\left(a_i^t-\mu^t\right)^2}$

where $W_{h h}$ is the recurrent hidden to hidden weights and $W_{x h}$ are the bottom up input to hidden weights. $\odot$ is the element-wise multiplication between two vectors. $\mathbf{b}$ and $\mathbf{g}$ are defined as the bias and gain parameters of the same dimension as $\mathbf{h}^t$ .

在标准RNN中存在梯度爆炸和消失问题，用了LN之后会更加稳定。

贴两个图便于理解：
在这里插入图片描述

视频讲解可以参考：What is Layer Normalization? | Deep Learning Fundamentals

🍟 举个例子

BatchNorm（批归一化）

BN 是在一个 batch 内，同一维度、不同样本之间做归一化。

举例说明：

假如每个样本有 5 个指标（特征）：

身高、体重、血压、心率、血糖

一个 batch 里有 4 个人（4 个样本），数据可能如下：

	身高	体重	血压	心率	血糖
人1	170	65	120	75	5.2
人2	180	72	130	68	4.8
人3	160	60	110	80	5.0
人4	175	70	125	72	5.1

BatchNorm 对身高这一列：

会把这一列的 4 个值 (170, 180, 160, 175) 计算均值和方差，然后归一化（减均值除方差）。
体重这一列也是同理，分别对每一列归一化。

本质上：

同一维度（比如身高），不同样本之间做归一化。
这样可以减少由于 batch 内分布变化造成的不稳定。
但是对 batch size 比较敏感，batch size 太小效果会变差。

LayerNorm（层归一化）

LN 是在每个样本内部，所有特征之间做归一化。

举例说明：

同样的数据，每个人是一个样本（向量），LN 对每个人自己单独做归一化。

比如人1（[170, 65, 120, 75, 5.2]）：

计算这 5 个值的均值和方差
把这 5 个值全部归一化（每个人的 5 个指标都做减均值、除方差）

对人2也是对 [180, 72, 130, 68, 4.8] 这5个值单独做。

本质上：

同一个样本（人），所有特征之间做归一化。
不依赖 batch size，每个样本可以独立处理。
适合做 NLP/Transformer 或者时序任务，或者 batch size 比较小的场景。

小结一句话：

BN：横向对比，一群人某个指标如何分布
LN：纵向对比，一个人自身各项指标如何分布

📖 代码实现

这边贴一个Restormer中的LN层的实现

首先定义两个函数用于reshape。4d到3d不需要参数，因为只需要把已有的两个维度合并；3d到4d需要参数，因为需要把一个维度分成两个维度

def to_3d(x):
    return rearrange(x, 'b c h w -> b (h w) c')

def to_4d(x,h,w):
    return rearrange(x, 'b (h w) c -> b c h w',h=h,w=w)

定义一个没有bias的LN层，weight是可学习的参数，所以用 $nn . P a r am e t er$ 包装

# 没有bias的LayerNorm层
class BiasFree_LayerNorm(nn.Module):
    def __init__(self, normalized_shape):
        super(BiasFree_LayerNorm, self).__init__()
        if isinstance(normalized_shape, numbers.Integral):
            normalized_shape = (normalized_shape,)
        normalized_shape = torch.Size(normalized_shape)

        assert len(normalized_shape) == 1

        self.weight = nn.Parameter(torch.ones(normalized_shape))
        self.normalized_shape = normalized_shape

    def forward(self, x):
    	#x的维度(batch_size, height x width, channels)
    	#sigma的维度(batch_size, height x width, 1)
        sigma = x.var(-1, keepdim=True, unbiased=False)
        return x / torch.sqrt(sigma+1e-5) * self.weight

定义一个有bias的LN层，同样的，weight和bias都是可学习的参数

class WithBias_LayerNorm(nn.Module):
    def __init__(self, normalized_shape):
        super(WithBias_LayerNorm, self).__init__()
        #如果输入的normalized_shape是个整数，则化为元组
        if isinstance(normalized_shape, numbers.Integral):
            normalized_shape = (normalized_shape,)
        normalized_shape = torch.Size(normalized_shape)

        assert len(normalized_shape) == 1

        self.weight = nn.Parameter(torch.ones(normalized_shape))
        #比上面多定义一个bias
        self.bias = nn.Parameter(torch.zeros(normalized_shape))
        self.normalized_shape = normalized_shape

    def forward(self, x):
        mu = x.mean(-1, keepdim=True)
        sigma = x.var(-1, keepdim=True, unbiased=False)
        return (x - mu) / torch.sqrt(sigma+1e-5) * self.weight + self.bias#这边加了bias

把上面的函数包装起来，定义一个统一的层规范化函数

class LayerNorm(nn.Module):
    def __init__(self, dim, LayerNorm_type):
        super(LayerNorm, self).__init__()
        if LayerNorm_type =='BiasFree':
            self.body = BiasFree_LayerNorm(dim)
        else:
            self.body = WithBias_LayerNorm(dim)

    def forward(self, x):
        h, w = x.shape[-2:]
        return to_4d(self.body(to_3d(x)), h, w)