研读论文《Attention Is All You Need》（15）

CS创新实验室

已于 2025-06-16 14:45:21 修改

阅读量239

点赞数 9

分类专栏：研读论文文章标签：注意力机制大模型 transformer

于 2025-06-16 14:45:05 首次发布

本文链接：https://blog.csdn.net/qiwsir/article/details/148691251

版权

研读论文专栏收录该内容

17 篇文章

订阅专栏

原文 40

6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.

Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.

在这里插入图片描述

翻译

6.2 模型变体分析

为评估转换器不同组件的重要性，我们采用不同方式对基础模型进行变异，并在英德翻译开发集newstest2013上测量性能变化。如前一节所述，我们使用了集束搜索，但未采用检查点平均策略。相关结果呈现在表3中。

表3：Transformer架构的变体。未列出的值与基础模型相同。所有指标均基于英德翻译开发集 newstest2013。所列出的困惑度（perplexities）是按照我们的字节对编码（byte-pair encoding）计算的每个词片段（wordpiece）的困惑度，不应将其与每个完整词（per-word）的困惑度进行比较。

在这里插入图片描述

重点句子解析

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013.

【解析】

句子的结构是：不定式+主体+分词短语。句首的不定式To evaluate the importance…做目的状语，后边的两个介词短语都是后置定语。这种带有嵌套式后置定语的结构可以简化为A of B of C，翻译为“C的B的A”。we varied our base model in different ways是句子的主体，其结构为：主语(we)+谓语(varied)+宾语(our base mode)+方式状语(in different ways)。现在分词短语measuring the change…做伴随状语，表示measuring这一动作是和主要谓语动作(varied)同时发生或伴随发生的动作。(in performance) (on English-to-German translation) (on the development set, newstest2013)是由三个介词短语构成的嵌套式后置定语，其中in performance是后置定语，修饰the change；on English-to-German translation做后置定语，修饰performance；on the development set, newstest2013也是做后置定语，修饰English-to-German translation。

【参考翻译】

为评估转化器不同组件的重要性，我们采用不同方式对基础模型进行变异，并测量模型在newstest2013开发集的英德翻译任务上的性能变化

原文 41

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

翻译

在表3的(A)行中，我们按照第3.2.2节所述的方法，在保持计算量不变的情况下，对注意力头的数量以及注意力键和值的维度进行调整。实验表明：虽然单头注意力机制比最佳设置的BLEU值低0.9，但注意力头过多时质量也会下降。

重点句子解析

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2.

【解析】

句子的整体结构分为四部分：介词短语+主体+现在分词短语+省略的方式状语从句。

第一部分的介词短语In Table 3 rows (A)做地点状语。

第二部分是句子的主体。为了把第二部分化繁为简，我们可以把the number of attention heads看作A，把the attention key and value dimensions看作B，从而得出：we vary A and B(我们改变A和B，或：我们对A和B进行调整)。其中，A的中心词是the number，B的中心词是dimensions，其余都是修饰成分。

第三部分的keeping …constant是现在分词短语做伴随状语。也就是说：分词短语(keeping…)所表示的动作与主要谓语动作(vary)同时发生或伴随发生。其中，the amount of computation (计算量) 做keeping的宾语；形容词constant(不变的，恒定的)修饰宾语，做宾语补足语(简称“宾补”)。

第四部分as described in Section 3.2.2是方式状语从句。完整的从句应该是：as it is described in Section 3.2.2. 可以理解为：正如第3.2.2节所描述的那样，或者：按照第3.2.2节所述的方法。

【参考翻译】

在表3的(A)行中，我们按照第3.2.2节所述的方法，在保持计算量不变的情况下，对注意力头的数量以及注意力键和值的维度进行调整。

原文 42

In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows © and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.

翻译

在表3的(B)行中，我们观察到减小注意力键的大小（dk）会降低模型质量。这表明确定兼容性并非易事，且相较于点积运算，更复杂的兼容性函数可能更具优势。在©和(D)行中我们进一步观察到：正如预期的那样，更大的模型表现更优，而丢弃操作技术能有效防止过拟合。在(E)行中，我们将正弦位置编码替换为习得的位置嵌入[9]，发现其结果与基础模型几乎完全一致。