BERT模型架构

大多_C

已于 2024-08-25 17:57:18 修改

阅读量2.8k

点赞数 18

CC 4.0 BY-SA版权

文章标签： bert 人工智能深度学习

于 2024-08-25 17:56:50 首次发布

本文链接：https://blog.csdn.net/weixin_46933702/article/details/141532178

BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)

BERT（Bidirectional Encoder Representations from Transformers）是一个基于Transformer架构的模型，旨在生成上下文相关的词嵌入。BERT的架构由几个主要组件构成，包括嵌入层、编码器层和池化层。下面是对这些组件的详细介绍：

1. 嵌入层（BertEmbeddings）

Word Embeddings: BERT使用词嵌入矩阵将词汇表中的每个词映射到一个固定长度的向量。BERT的词汇表大小为30522，嵌入维度为768。
Position Embeddings: 由于Transformer架构本身不具有序列信息，BERT通过位置嵌入将位置信息引入模型。位置嵌入的最大长度为512。
Token Type Embeddings: BERT使用特殊的token类型嵌入（也称为segment embeddings）来区分句子A和句子B，特别是在处理句子对任务时，比如自然语言推理和问答任务。这是一个大小为2的嵌入层。
Layer Normalization and Dropout: 在嵌入层之后，BERT应用LayerNorm进行归一化，并使用Dropout防止过拟合。

2. 编码器层（BertEncoder）

BERT的编码器由多个BertLayer层组成，通常为12层或24层，具体取决于模型的大小（BERT-base或BERT-large）。每个BertLayer包括：

Self-Attention Mechanism: 使用多头自注意力机制，BERT能够从句子的每个词中同时关注其他词。这种机制由query、key、value三个线性变换组成。
BertSelfOutput: 自注意力的输出通过一个全连接层（线性变换）和LayerNorm，并应用Dropout。
Intermediate Layer: 包含一个带有GELU激活函数的前馈网络。它将768维嵌入扩展到3072维。
BertOutput: 将前馈网络的输出缩减回768维，并应用LayerNorm和Dropout。

3. 池化层（BertPooler）

Dense Layer and Activation: BERT在池化层中使用一个线性变换和Tanh激活函数来处理编码器的输出。通常，池化层提取的是序列的[CLS] token的向量表示，用于分类任务。

BERT通过其双向编码器架构在语言模型中引入了上下文的全局信息。使用预训练和微调的策略，BERT在多种自然语言处理任务中取得了卓越的性能。其架构中的每个组成部分共同作用，使得BERT能够有效地学习和表示复杂的语言模式。

在BERT模型中，"intermediate"指的是BertLayer中的中间层（Intermediate Layer），它是Transformer架构的一部分。具体来说，它是一个前馈神经网络的第一部分，负责将输入的特征向量进行扩展和非线性变换。以下是中间层的详细说明：

Intermediate Layer 的功能和结构：

输入和输出：
- 输入：来自自注意力机制后的特征向量，其维度通常为768（对于BERT-base）。
- 输出：一个更高维度的特征向量，在BERT-base中，该维度为3072。
线性变换（Dense Layer）：
- 中间层的第一个步骤是通过一个线性层（全连接层）将输入向量的维度从768扩展到3072。这个维度扩展的目的是提高模型的表示能力，使其能够捕捉更复杂的特征和模式。
激活函数（GELU）：
- 在线性变换之后，BERT使用GELU（Gaussian Error Linear Unit）作为激活函数。GELU是一种非线性激活函数，能够更平滑地处理输入，通常比ReLU（Rectified Linear Unit）表现更优。

中间层是整个BERT层中的一个关键部分，结合了线性扩展和非线性变换，它为模型提供了更大的灵活性和表达能力，使得BERT在处理复杂的语言任务时表现卓越。通过这种设计，BERT能够更好地捕捉输入序列中的深层次特征。

这一部分描述了BERT模型中一个BertLayer的具体结构，包括自注意力机制、中间层和输出层。以下是对每个组件的详细介绍：

自注意力机制（Self-Attention）

1. BertSdpaSelfAttention

Query、Key、Value线性变换：
- Query: 一个线性层将输入特征向量（768维）映射到一个新的768维特征空间。
- Key: 类似于Query，Key线性层执行相同的映射。
- Value: Value线性层也将输入特征映射到768维空间。
Dropout：在计算自注意力权重后，应用Dropout（概率0.1）以防止过拟合。

自注意力机制的核心思想是通过Query和Key计算注意力权重，然后用这些权重对Value进行加权求和。这样，每个输入token都能"关注"序列中的其他token，从而捕捉到上下文信息。