研读论文《Attention Is All You Need》（16）

CS创新实验室

已于 2025-06-17 09:15:23 修改

阅读量1.4k

点赞数 44

CC 4.0 BY-SA版权

分类专栏：研读论文文章标签：人工智能 transformer self-attention attention 注意力机制大模型

于 2025-06-17 09:15:06 首次发布

本文链接：https://blog.csdn.net/qiwsir/article/details/148707950

研读论文专栏收录该内容

18 篇文章

订阅专栏

原文 43

6.3 English Constituency Parsing

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to- sequence models have not been able to attain state-of-the-art results in small-data regimes [37].

翻译

6.3 英语成分句法分析

为了评估转换器是否能够泛化到其他任务，我们在英语成分句法分析任务上进行了实验。这项任务具有特定的挑战性：其输出受到强烈的结构约束，并且输出长度明显长于输入。此外，在小数据量场景下，RNN 的序列到序列模型未能达到最先进的性能水平 [37]。

重点句子解析

Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes.

【解析】

这是一个简单句，主干是：models have not been able to attain results. 其中，谓语结构是have not been able to attain，这是be able to do (sth.)的现在完成时的否定式。其演化过程如下：一般现在时的肯定形式is/are able to do (sth.) → 现在完成时的肯定形式has/have been able to do (sth.) → 现在完成时的否定形式has/have not been able to do (sth.)。主语名词models前边的RNN sequence-to-sequence是定语，宾语名词results前边的形容词state-of-the-art也是定语，句尾的介词短语in small-data regimes做状语，表示谓语动作发生的场景。句首的副词furthermore也是做状语，意为“此外，而且”。

【参考翻译】

此外，在小数据量场景下，RNN 的序列到序列模型未能达到最先进的性能水平。

原文 44

We trained a 4-layer transformer with $d_{model} =1024$ on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.

翻译

我们在宾州树库的《华尔街日报》(WSJ)部分（约4万条训练句子）上训练了一个4层转换器模型，其隐层维度dmodel=1024。我们还在半监督设置下对该模型进行了训练，使用了更大的高置信度语料库和 BerkleyParser 语料库，总共约有 1,700 万条句子 [37]。在仅使用WSJ语料的设置中，我们采用了一个包含16,000个词元/词符的词汇表；而在半监督的设置中，则使用了一个包含32,000个词元/词符的词汇表。

重点句子解析

We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences.

【解析】

句子的主干是：We trained it. 其中，宾语it指代前边提到的转换器模型；原句中的副词also修饰谓语trained，做状语。现在分词短语using … and …做方式状语，修饰谓语trained；介词短语“from with approximately 17M sentences” 可能是 “with approximately 17M sentences” 的笔误，在句中做后置定语，修饰corpora(语料库)，说明语料库的规模。其中的介词with表示“带有，具有”。

【参考翻译】

我们还在半监督设置下对该模型进行了训练，使用了更大的高置信度语料库和 BerkleyParser 语料库，总共约有1,700万条句子。

We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.

【解析】

英语中为了避免重复或者使句子更加简练，往往会出现省略现象。在这句话中，为了便于理解，我们将其补全为：We used a vocabulary of 16K tokens for the WSJ only setting and (we used) a vocabulary of 32K tokens for the semi-supervised setting. 这样一来，我们很容易看出，and连接了两个结构一致的并列句。接下来，我们把a vocabulary of 16K tokens看作A，把 the WSJ only setting看作B，并且把a vocabulary of 32K tokens看作C，把the semi-supervised setting看作D，就可以把整句话简化为：We used A for B and (we used) C for D. 其中，A和C分别表示不同的词汇量；B和D分别表示不同的设置。

【参考翻译】

在仅使用WSJ语料的设置中，我们采用了一个包含16,000个词元/词符的词汇表；而在半监督的设置中，则使用了一个包含32,000个词元/词符的词汇表。

原文 45

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting.

翻译

我们仅进行了少量实验，以便对丢弃参数（包括第5.4节提到的注意力机制参数和残差连接参数）、学习率以及集束搜索宽度（基于第22节开发集）进行选择，其余所有参数均沿用英德翻译基础模型的设置，未作更改。在推理阶段，我们将最大输出长度设置为输入长度+ 300。无论是纯WSJ数据还是半监督训练设置，均采用集束宽度为21和长度惩罚系数α=0.3的参数组合。

重点句子解析

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model.

【解析】

这句话由两个分句组成，第一个分句是：We performed …experiments to select…on the Section 22 development set. 第二个分句是：all other parameters remained unchanged from … model. 此外，根据第二个分句中parameters前边的修饰词“all other”判断，第一个分句谈论的也应该是parameters。也就是说，不定式to select的宾语实际上也应该是各种parameters。换句话说，第一个分句中不定式的完整表达实际上是“to select (the parameters of) the dropout, …learning rates and beam size…”。

第一个分句的主干是：We performed experiments. 原句中的only是程度副词，做状语； a small number of(为数不多的，少量的) 修饰experiments，做定语。不定式to select the dropout, …learning rates and beam size…做目的状语。原句中的both attention and residual (section 5.4)是插入语和同位语，对the dropout进行解释说明；on the Section 22 development set是后置定语，修饰learning rates and beam size.

第二个分句的主干是：all other parameters remained unchanged. 后边的介词短语from the English-to-German base translation model做后置定语，修饰parameters(参数)，表示参数的来源。实际上，这个分句也可以写成：all other parameters from the English-to-German base translation model remained unchanged.

【参考翻译】

我们仅进行了少量实验，以便对丢弃参数（包括第5.4节提到的注意力机制参数和残差连接参数）、学习率以及集束搜索宽度（基于第22节开发集）进行选择，其余所有参数均沿用英德翻译基础模型的设置，未作更改。

原文 46

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].

Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)

在这里插入图片描述

翻译

表4中的结果显示，尽管缺乏针对特定任务的调优，我们的模型表现却出人意料地好——除递归神经网络语法模型（RNNG）[8]外，其性能优于所有已发布的模型。

表4：对于英语成分句法分析任务，转换器表现出良好的泛化能力（测试结果基于WSJ语料库第23节数据）

在这里插入图片描述

重点句子解析

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar.

【解析】

句子的整体结构是：主句+宾语从句。

主句是：Our results in Table 4 show. 主语Our results后边的介词短语in Table 4做后置定语。

谓语动词show的后边一直到句尾都是that引导的宾语从句。

宾语从句的结构是：介词短语(despite…)+主体+现在分词短语(yielding…)+介词短语(with the exception of…)。从句的主体是：our model performs surprisingly well. 介词短语despite the lack of task-specific tuning做让步状语；现在分词短语yielding better results than all previously reported models做结果状语。其中的better…than…构成比较结构，all previously reported做定语，共同修饰models；介词短语with the exception of…是固定短语，表示“除…之外”。

【参考翻译】

表4中的结果显示，尽管缺乏针对特定任务的调优，我们的模型表现却出人意料地好——除递归神经网络语法模型（RNNG）外，其性能优于所有已发布的模型。

原文 47

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-Parser [29] even when training only on the WSJ training set of 40K sentences.