注意力机制：让AI拥有黄金七秒记忆的魔法--（缩放点积注意力）

最新推荐文章于 2025-06-23 09:24:30 发布

y江江江江

最新推荐文章于 2025-06-23 09:24:30 发布

阅读量1.2k

点赞数 19

CC 4.0 BY-SA版权

分类专栏：机器学习大模型文章标签：人工智能

本文链接：https://blog.csdn.net/u011146203/article/details/146283768

机器学习同时被 2 个专栏收录

28 篇文章

订阅专栏

大模型

13 篇文章

订阅专栏

注意力机制：让AI拥有"黄金七秒记忆"的魔法–（缩放点积注意力）

一、缩放点积注意力

缩放点积注意⼒（Scaled Dot-Product Attention）和点积注意⼒（Dot-Product Attention）之间的主要区别在于：缩放点积注意⼒在计算注意⼒权重之前，会将点积结果也就是原始权重除以⼀个缩放因⼦，得到缩放后的原始权重。通常，这个缩放因⼦是输⼊特征维度的平⽅根。

为什么要使⽤缩放因⼦呢？在深度学习模型中，点积的值可能会变得⾮常⼤，尤其是当特征维度较⼤时。当点积值特别⼤时，softmax函数可能会在⼀个⾮常陡峭的区域内运⾏，导致梯度变得⾮常⼩，也可能会导致训练过程中梯度消失。通过使⽤缩放因⼦，可以确保softmax函数在⼀个较为平缓的区域内⼯作，从⽽减轻梯度消失问题，提⾼模型的稳定性。

如上图，相比于点积注意力，多了第三步：将原始权重（得分）除以缩放因子。

import torch # 导入 torch
import torch.nn.functional as F # 导入 nn.functional
# 1. 创建两个张量 x1 和 x2
x1 = torch.randn(2, 3, 4) # 形状 (batch_size, seq_len1, feature_dim)
x2 = torch.randn(2, 5, 4) # 形状 (batch_size, seq_len2, feature_dim)
# 2. 计算张量点积，得到原始权重
raw_weights = torch.bmm(x1, x2.transpose(1, 2)) # 形状 (batch_size, seq_len1, seq_len2)
# 3. 将原始权重除以缩放因子
scaling_factor = x1.size(-1) ** 0.5
scaled_weights = raw_weights  / scaling_factor # 形状 (batch_size, seq_len1, seq_len2)
# 4. 对原始权重进行归一化
attn_weights  =  F.softmax(raw_weights, dim=2) #  形 状 (batch_size,  seq_len1,  seq_len2)
# 5. 使用注意力权重对 x2 加权求和
attn_output = torch.bmm(attn_weights, x2)  # 形状 (batch_size, seq_len1, feature_dim)

二、编码器-解码器注意力

刚才，为了简化讲解的难度，也为了让你把全部注意⼒放在注意⼒机制本身上⾯。我们并没有说明，x1、x2在实际应⽤中分别代表着什么。

当我们应⽤点积注意⼒时，解码器的每个时间步都会根据编码器的隐藏状态计算⼀个注意⼒权重，然后将这些权重应⽤于编码器隐藏状态，以⽣成⼀个上下⽂向量（编码器-解码器注意⼒的输出）。这个上下⽂向量将包含关于编码器输⼊序列的有⽤信息，解码器可以利⽤这个信息⽣成更准确的输出序列，如下图所示。

要在这个程序中加⼊编码器-解码器注意⼒机制，我们可以按照以下步骤进⾏修改。

（1）定义Attention类。⽤于计算注意⼒权重和注意⼒上下⽂向量。

（2）重构Decoder类。更新Decoder类的初始化部分和前向传播⽅法，使其包含注意⼒层并在解码过程中利⽤注意⼒权重。

（3）重构Seq2Seq类。更新Seq2Seq类的前向传播⽅法，以便将编码器的输出传递给解码器。

（4）可视化注意⼒权重。

2.1 构建语料库

# 构建语料库，每行包含中文、英文（解码器输入）和翻译成英文后的目标输出 3 个句子
sentences = [
    ['咖哥 喜欢 小冰', '<sos> KaGe likes XiaoBing', 'KaGe likes XiaoBing <eos>'],
    ['我 爱 学习 人工智能', '<sos> I love studying AI', 'I love studying AI <eos>'],
    ['深度学习 改变 世界', '<sos> DL changed the world', 'DL changed the world <eos>'],
    ['自然 语言 处理 很 强大', '<sos> NLP is so powerful', 'NLP is so powerful <eos>'],
    ['神经网络 非常 复杂', '<sos> Neural-Nets are complex', 'Neural-Nets are complex <eos>']]
word_list_cn, word_list_en = [], []  # 初始化中英文词汇表
# 遍历每一个句子并将单词添加到词汇表中
for s in sentences:
    word_list_cn.extend(s[0].split())
    word_list_en.extend(s[1].split())
    word_list_en.extend(s[2].split())
# 去重，得到没有重复单词的词汇表
word_list_cn = list(set(word_list_cn))
word_list_en = list(set(word_list_en))
# 构建单词到索引的映射
word2idx_cn = {w: i for i, w in enumerate(word_list_cn)}
word2idx_en = {w: i for i, w in enumerate(word_list_en)}
# 构建索引到单词的映射
idx2word_cn = {i: w for i, w in enumerate(word_list_cn)}
idx2word_en = {i: w for i, w in enumerate(word_list_en)}
# 计算词汇表的大小
voc_size_cn = len(word_list_cn)
voc_size_en = len(word_list_en)
print(" 句子数量：", len(sentences)) # 打印句子数
print(" 中文词汇表大小：", voc_size_cn) # 打印中文词汇表大小
print(" 英文词汇表大小：", voc_size_en) # 打印英文词汇表大小
print(" 中文词汇到索引的字典：", word2idx_cn) # 打印中文词汇到索引的字典
print(" 英文词汇到索引的字典：", word2idx_en) # 打印英文词汇到索引的字典

2.2 设定编码器和解码器设定的张量

import numpy as np # 导入 numpy
import torch # 导入 torch
import random # 导入 random 库
# 定义一个函数，随机选择一个句子和词汇表生成输入、输出和目标数据
def make_data(sentences):
    # 随机选择一个句子进行训练
    random_sentence = random.choice(sentences)
    # 将输入句子中的单词转换为对应的索引
    encoder_input = np.array([[word2idx_cn[n] for n in random_sentence[0].split()]])
    # 将输出句子中的单词转换为对应的索引
    decoder_input = np.array([[word2idx_en[n] for n in random_sentence[1].split()]])
    # 将目标句子中的单词转换为对应的索引
    target = np.array([[word2idx_en[n] for n in random_sentence[2].split()]])
    # 将输入、输出和目标批次转换为 LongTensor
    encoder_input = torch.LongTensor(encoder_input)
    decoder_input = torch.LongTensor(decoder_input)
    target = torch.LongTensor(target)
    return encoder_input, decoder_input, target 
# 使用 make_data 函数生成输入、输出和目标张量
encoder_input, decoder_input, target = make_data(sentences)
for s in sentences: # 获取原始句子
    if all([word2idx_cn[w] in encoder_input[0] for w in s[0].split()]):
        original_sentence = s
        break
print(" 原始句子：", original_sentence) # 打印原始句子
print(" 编码器输入张量的形状：", encoder_input.shape)  # 打印输入张量形状
print(" 解码器输入张量的形状：", decoder_input.shape) # 打印输出张量形状
print(" 目标张量的形状：", target.shape) # 打印目标张量形状
print(" 编码器输入张量：", encoder_input) # 打印输入张量
print(" 解码器输入张量：", decoder_input) # 打印输出张量
print(" 目标张量：", target) # 打印目标张量

2.3 定义attention类

# 定义 Attention 类
import torch.nn as nn # 导入 torch.nn 库
class Attention(nn.Module):
    def __init__(self):
        super(Attention, self).__init__()
    def forward(self, decoder_context, encoder_context):
        # 计算 decoder_context 和 encoder_context 的点积，得到注意力分数
        scores = torch.matmul(decoder_context, encoder_context.transpose(-2, -1))
        # 归一化分数
        attn_weights = nn.functional.softmax(scores, dim=-1)
        # 将注意力权重乘以 encoder_context，得到加权的上下文向量
        context = torch.matmul(attn_weights, encoder_context)
        return context, attn_weights

点积解码器和译码器内容得到注意力分数后进行归一化得到权重，之后将权重乘解码器的内容以得到加权上下文的向量

2.4 重构Decoder类

import torch.nn as nn # 导入 torch.nn 库
# 定义编码器类
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Encoder, self).__init__()       
        self.hidden_size = hidden_size # 设置隐藏层大小       
        self.embedding = nn.Embedding(input_size, hidden_size) # 创建词嵌入层       
        self.rnn = nn.RNN(hidden_size, hidden_size, batch_first=True) # 创建 RNN 层    
    def forward(self, inputs, hidden): # 前向传播函数
        embedded = self.embedding(inputs) # 将输入转换为嵌入向量       
        output, hidden = self.rnn(embedded, hidden) # 将嵌入向量输入 RNN 层并获取输出
        return output, hidden
# 定义解码器类
class DecoderWithAttention(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderWithAttention, self).__init__()
        self.hidden_size = hidden_size # 设置隐藏层大小
        self.embedding = nn.Embedding(output_size, hidden_size) # 创建词嵌入层
        self.rnn = nn.RNN(hidden_size, hidden_size, batch_first=True) # 创建 RNN 层
        self.attention = Attention()  # 创建注意力层
        self.out = nn.Linear(2 * hidden_size, output_size)  # 修改线性输出层，考虑隐藏状态和上下文向量
    def forward(self, dec_input, hidden, enc_output):
        embedded = self.embedding(dec_input)  # 将输入转换为嵌入向量
        rnn_output, hidden = self.rnn(embedded, hidden)  # 将嵌入向量输入 RNN 层并获取输出 
        context, attn_weights = self.attention(rnn_output, enc_output)  # 计算注意力上下文向量
        dec_output = torch.cat((rnn_output, context), dim=-1)  # 将上下文向量与解码器的输出拼接
        dec_output = self.out(dec_output)  # 使用线性层生成最终输出
        return dec_output, hidden, attn_weights
n_hidden = 128 # 设置隐藏层数量
# 创建编码器和解码器
encoder = Encoder(voc_size_cn, n_hidden)
decoder = DecoderWithAttention(n_hidden, voc_size_en)
print(' 编码器结构：', encoder)  # 打印编码器的结构
print(' 解码器结构：', decoder)  # 打印解码器的结构

嵌入层：

将解码器的输入（目标句子单词的索引）转成向量。

RNN 层：

处理嵌入后的向量，得到当前时间步的输出以及更新后的隐藏状态，就像大厨根据配料进行预处理。

注意力机制：

利用 Attention 层，根据当前解码器的输出（rnn_output）和编码器的输出（enc_output）计算注意力分数，并生成加权的上下文向量 context。

拼接与输出：

将 RNN 输出和上下文向量在最后一维进行拼接，然后通过一个全连接层（线性层）得到最终的预测输出。

编码器结构： Encoder(

  (embedding): Embedding(18, 128)

  (rnn): RNN(128, 128, batch_first=True)

)

 解码器结构： DecoderWithAttention(

  (embedding): Embedding(20, 128)

  (rnn): RNN(128, 128, batch_first=True)

  (attention): Attention()

  (out): Linear(in_features=256, out_features=20, bias=True)

)

2.5 重构Seq2Seq类

# 定义 Seq2Seq 类
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        # 初始化编码器和解码器
        self.encoder = encoder
        self.decoder = decoder    
    def forward(self, encoder_input, hidden, decoder_input): 
        # 将输入序列通过编码器并获取输出和隐藏状态
        encoder_output, encoder_hidden = self.encoder(encoder_input, hidden)
        # 将编码器的隐藏状态传递给解码器作为初始隐藏状态
        decoder_hidden = encoder_hidden
        # 将目标序列通过解码器并获取输出 -  此处更新解码器调用
        decoder_output, _, attn_weights = self.decoder(decoder_input, decoder_hidden, encoder_output) 
        return decoder_output, attn_weights
# 创建 Seq2Seq 模型
model = Seq2Seq(encoder, decoder)
print('S2S 模型结构：', model)  # 打印模型的结构

S2S 模型结构： Seq2Seq(

  (encoder): Encoder(

    (embedding): Embedding(18, 128)

    (rnn): RNN(128, 128, batch_first=True)

  )

  (decoder): DecoderWithAttention(

    (embedding): Embedding(20, 128)

    (rnn): RNN(128, 128, batch_first=True)

    (attention): Attention()

    (out): Linear(in_features=256, out_features=20, bias=True)

  )

)

步骤解析:

编码器部分：

使用

encoder_input

和初始隐藏状态

hidden

作为输入，通过编码器处理后得到两个结果：

encoder_output：编码器在每个时间步的输出信息。
encoder_hidden：编码器处理完所有输入后的最终隐藏状态。

传递隐藏状态：
- 将编码器的隐藏状态 encoder_hidden 直接作为解码器的初始隐藏状态，这样解码器便能“接续”编码器的记忆，保证信息的连续性。
解码器部分：
- 将 decoder_input（目标序列的输入）、初始隐藏状态 decoder_hidden 以及编码器的输出 encoder_output 一同输入解码器。
- 解码器内部会通过注意力机制计算当前解码状态与编码器输出之间的相关性，生成加权的上下文向量，并与解码器的 RNN 输出结合后产生最终预测。
- 注意解码器返回了三个值，这里我们只需要 decoder_output 和 attn_weights（注意力权重），中间那个被忽略了，用 _ 表示。
返回结果：
- 函数最后返回解码器的输出 decoder_output 和注意力权重 attn_weights，这两个信息分别代表生成的目标序列和翻译过程中各个时间步关注的原始输入信息。

想象整个流程就像是编码器（大侦探）首先搜集所有线索，然后把搜集到的重要信息传给解码器（翻译大师），翻译大师通过“放大镜”（注意力机制）仔细核对细节，最终生成精准的译文。

2.6可视化权重

训练函数：

# 定义训练函数
def train_seq2seq(model, criterion, optimizer, epochs):
    for epoch in range(epochs):
       encoder_input, decoder_input, target = make_data(sentences) # 训练数据的创建
       hidden = torch.zeros(1, encoder_input.size(0), n_hidden) # 初始化隐藏状态      
       optimizer.zero_grad()# 梯度清零        
       output, _ = model(encoder_input, hidden, decoder_input) # 获取模型输出         
       loss = criterion(output.view(-1, voc_size_en), target.view(-1)) # 计算损失        
       if (epoch + 1) % 40 == 0: # 打印损失
          print(f"Epoch: {epoch + 1:04d} cost = {loss:.6f}")         
       loss.backward()# 反向传播        
       optimizer.step()# 更新参数      
# 训练模型
epochs = 400 # 训练轮次
criterion = nn.CrossEntropyLoss() # 损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # 优化器
train_seq2seq(model, criterion, optimizer, epochs) # 调用函数训练模型

定义一个可用于可视化注意力的函数：

import matplotlib.pyplot as plt # 导入 matplotlib
import seaborn as sns # 导入 seaborn
plt.rcParams["font.family"]=['SimHei'] # 用来设定字体样式
plt.rcParams['font.sans-serif']=['SimHei'] # 用来设定无衬线字体样式
plt.rcParams['axes.unicode_minus']=False #  用 来 正 常 显 示 负 号 
def  visualize_attention(source_sentence, predicted_sentence, attn_weights):    
    plt.figure(figsize=(10, 10)) # 画布
    ax = sns.heatmap(attn_weights, annot=True, cbar=False, 
                     xticklabels=source_sentence.split(), 
                     yticklabels=predicted_sentence, cmap="Greens") # 热力图
    plt.xlabel(" 源序列 ") 
    plt.ylabel(" 目标序列 ")
    plt.show() # 显示图片

在测试模型的过程中，可视化注意力权重：

# 定义测试函数
def test_seq2seq(model, source_sentence):
    # 将输入的句子转换为索引
    encoder_input = np.array([[word2idx_cn[n] for n in source_sentence.split()]])
    # 构建输出的句子的索引，以 '<sos>' 开始，后面跟 '<eos>'，长度与输入句子相同
    decoder_input = np.array([word2idx_en['<sos>']] + [word2idx_en['<eos>']]*(len(encoder_input[0])-1))
    # 转换为 LongTensor 类型
    encoder_input = torch.LongTensor(encoder_input)
    decoder_input = torch.LongTensor(decoder_input).unsqueeze(0) # 增加一维    
    hidden = torch.zeros(1, encoder_input.size(0), n_hidden) # 初始化隐藏状态    
    # 获取模型输出和注意力权重
    predict, attn_weights = model(encoder_input, hidden, decoder_input)    
    predict = predict.data.max(2, keepdim=True)[1] # 获取概率最大的索引
    # 打印输入的句子和预测的句子
    print(source_sentence, '->', [idx2word_en[n.item()] for n in predict.squeeze()])
    # 可视化注意力权重
    attn_weights = attn_weights.squeeze(0).cpu().detach().numpy()
    visualize_attention(source_sentence, [idx2word_en[n.item()] for n in predict.squeeze()], attn_weights)    
# 测试模型

test_seq2seq(model, '自然 语言 处理 很 强大')