研读论文《Attention Is All You Need》（9）-CSDN博客

本文链接：https://blog.csdn.net/qiwsir/article/details/148129815

原文 18

3.2.2 Multi-Head Attention

Instead of performing a single attention function with $d_{model}$ -dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k$ , $d_k$ and $d_v$ dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$ -dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

翻译

3.2.2 多头注意力

我们没有用 $d_{model}$ 维度的键、值和查询来执行单一的注意力函数，而是发现：将查询、键和值分别以不同的、习得的线性投影方式 $h$ 次投影到维度 $d_q$ （原文是 $d_k$ ，应该是 $d_q$ ）、 $d_k$ 和 $d_v$ ，效果更佳。然后，我们对每个投影版的查询、键和值并行地执行注意力函数，生成 $d_v$ 维度的输出值。我们将这些输出值拼接起来并再次投影，得到最终值，如图 2 所示。

重点句子解析

Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.

【解析】

句子的结构是：状语(instead of…)+主句。这句话也可以改写为：We didn’t perform… Instead， we found it beneficial to… 原句中，Instead of…做状语，表示动名词performing…和主句谓语动词found之间的对比关系。with dmodel-dimensional keys, values and queries做状语，修饰动名词performing，其中的with表示“使用”；主句“we found it beneficial to…”属于“主语(we)+谓语(found)+形式宾语(it)+宾补(beneficial )+真正的宾语(to…)”。其中it是形式宾语，beneficial 修饰宾语it，做宾补；真正的宾语是不定式结构。由于这个不定式较长，我们可以把the queries, keys and values看作A, 把different, learned linear projections看作B, 把dk, dk and dv dimensions看作C，从而把不定式简化为：to linearly project A (h times) (with B) to C ( respectively)，它表示：将A分别以B的方式h次投影到C。

【参考翻译】

我们没有用dmodel维度的键、值和查询来执行单一的注意力函数，而是发现：将查询、键和值分别以不同的、习得的线性投影方式h次投影到维度 $d_q$ 、 $d_k$ 和 $d_v$ ，效果更佳。

On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$ -dimensional output values.

【解析】

句子的结构是：地点状语+主句+结果状语。主句是：we (then) perform the attention function (in parallel). 其中的perform是谓语动词，前后分别是主语和宾语。括号中的副词then表示顺序或动作的先后关系，做状语；介词短语in parallel做方式状语，修饰动词perform。句首的介词短语做地点状语，each of these projected versions 相当于 each projected version；介词短语of queries, keys and values做后置定语，修饰versions。句尾的现在分词短语做结果状语，表示主要谓语动作perform所带来的结果。

【参考翻译】

然后，我们对每个投影版的查询、键和值并行地执行注意力函数，生成 $d_v$ 维度的输出值。

These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

【解析】

句子的结构是：主句+结果状语+定语从句的省略形式。主句是These are concatenated and once again projected. 其中，and连接了两个并列的过去分词concatenated和projected；once again做状语，修饰projected。现在分词短语resulting in the final values做结果状语，其中的result in是固定短语，意为：导致，产生；as depicted in Figure 2可以看作一个省略的非限制性定语从句，即：as is depicted in Figure 2(正如图2所描述的那样)，其中as表示：正如。

【参考翻译】

我们将这些输出值拼接起来并再次投影，得到最终值，如图2所示。

原文 19

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
$\begin{split} \text{MultiHead}(Q,K,V)=\text{Concat}(head_1,\cdots,head_h)W^O \\\text{where}~head_i=\text{Attention}(QW_i^Q,KW_i^K,VW^V_i) \end{split}$
Where the projections are parameter matrices $W^Q_i\in\mathbb{R}^{d_{model}\times d_q}$ , $W_i^K\in\mathbb{R}^{d_{model}\times d_k}$ , $W_i^V\in\mathbb{R}^{d_{model}\times d_v}$ and $W_i^O\in\mathbb{R}^{hd_{v}\times d_{model}}$ .

翻译

多头注意力机制使模型能够同时关注来自不同位置的不同表示子空间的信息。若仅使用单一注意力头，平均化操作会抑制这一特性。
$\begin{split} \text{MultiHead}(Q,K,V)=\text{Concat}(head_1,\cdots,head_h)W^O \\\text{where}~head_i=\text{Attention}(QW_i^Q,KW_i^K,VW^V_i) \end{split}$
其中，这些投影为参数矩阵 $W^Q_i\in\mathbb{R}^{d_{model}\times d_q}$ , $W_i^K\in\mathbb{R}^{d_{model}\times d_k}$ , $W_i^V\in\mathbb{R}^{d_{model}\times d_v}$ 以及 $W_i^O\in\mathbb{R}^{hd_{v}\times d_{model}}$ 。

重点句子解析

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

【解析】

句子的结构是：主语+谓语+宾语+宾补。其中，allows是句子的谓语，前边的Multi-head attention和后边的the model分别是主语和宾语；不定式to jointly attend to information…修饰宾语the model，做宾语。allow sb /sth to do sth.表示“允许…做某事”；其中的副词jointly本意是“共同，一起”，此处活译为“同时”；attend to information意为“处理信息”。介词短语from different representation subspaces做后置定语，修饰information；at different positions也是做后置定语，修饰subspaces。

【参考翻译】

多头注意力机制使模型能够同时关注来自不同位置的不同表示子空间的信息。

原文 19

In this work we employ $h = 8$ parallel attention layers, or heads. For each of these we use $d_k = d_v = d_{model}/h=64$ . Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

翻译

在这次操作中，我们采用 $h = 8$ 并行注意力层（或注意力头头）。每个头的维度设为 $d_k = d_v = d_{model}/h=64$ 。由于每个头的维度降低，其总计算成本与使用全维度的单头注意力机制的计算成本相当。

重点句子解析

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

【解析】

句首的due to…是原因状语，相当于because of…；介词短语of each head做后置定语，修饰the reduced dimension。后边的主干是：the total computational cost is similar to that. 其中，be similar to 是固定短语，表示“与…相似”；that指代the total computational cost；后边的of single-head attention是介词短语做后置定语，修饰that；with full dimensionality也是介词短语做后置定语，修饰single-head attention。其中的介词with表示“使用”。

【参考翻译】

由于每个头的维度降低，其总计算成本与使用全维度的单头注意力机制的总计算成本相当。