实战-逐步实现seq2seq+attention

最新推荐文章于 2024-08-22 09:10:15 发布

西檬饭

最新推荐文章于 2024-08-22 09:10:15 发布

阅读量820

点赞数 1

CC 4.0 BY-SA版权

分类专栏： RNN 文章标签：深度学习机器翻译

本文链接：https://blog.csdn.net/qq_23869697/article/details/105791691

RNN 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍使用Seq2Seq模型和Attention机制实现机器翻译的过程，包括数据预处理、模型搭建、训练与评估等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

实战来自慕课网《Google工程师亲授 Tensorflow2.0－入门到进阶》，这里是实现笔记和摘录。
这个实战主要使用一个seq2seq+attention机制实现机器翻译。着重分析每一步的实现过程和细节的分析。

数据处理

实战使用的是英语转西班牙语语料。

预处理

Unicode转为ASCII码

西班牙语的一些字符使用Unicode格式编码，由于使用Unicode编码格式得到的词表大，将其转为ASCII码后大小仅为256。 unicodedata.normalize(‘NFD’, s)，‘NFD’如果一个Unicode编码的字符需要多个ascii码表示，则将多个ascii码分开。 unicodedata.category© != 'Mn’中，‘Mn’表示重音。
代码实现：

import unicodedata
import re
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"

print(unicode_to_ascii(en_sentence))
print(unicode_to_ascii(sp_sentence))

结果：

May I borrow this book?
¿Puedo tomar prestado este libro?

正则匹配处理特殊字符

使用正则表达去匹配字符，对特殊的字符做修改：给特殊字符前后加空格，多空格去重，替换特殊字符为空格；
给句子加上开始和结束标记。
代码实现：

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())   
    # 标点符号前后加空格
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    # 多空格去重
    w = re.sub(r'[" "]+', " ", w)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.rstrip().strip()
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

结果：

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'

整合并读取数据

data_path = './data_10_1/spa-eng/spa.txt'
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]

def create_dataset(path, num_examples):
    lines = open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]

    return zip(*word_pairs)

en, sp = create_dataset(data_path, None)
print(en[-1])
print(sp[-1])

结果：

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>

数据id化和数据集生成

文本向量化

使用keras.preprocessing.text.Tokenizer生成词表；
使用tokenizer的texts_to_sequences()方法将文本转为id表示；使用tf.keras.preprocessing.sequence.pad_sequences对句子做padding；

def max_length(tensor):
    return max(len(t) for t in tensor)

def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    lang_tokenizer.fit_on_texts(lang)
    tensor = lang_tokenizer.texts_to_sequences(lang)
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
    return tensor, lang_tokenizer

def load_dataset(path, num_examples=None):
    # creating cleaned input, output pairs
    targ_lang, inp_lang = create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(data_path, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

切分

from sklearn.model_selection import train_test_split
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

结果：
(24000, 24000, 6000, 6000)

验证数据正确性

def convert(lang, tensor):
    for t in tensor:
        if t != 0:
            print ("%d ----> %s" % (t, lang.index_word[t]))
            
print("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print()
print("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

转为TFrecord格式数据

设定batch size按批次获取数据给模型：

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256 # 每个词转为多少维向量
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

example_input_batch, example_target_batch = next(iter(dataset))

建立模型

编码器

使用子类API实现编码器，指定batch_size和GRU或者LSTM单元个数，这里实现了一个一层的编码器。编码器的输入需要embedding，使用keras.layers.Embedding方法对输入进行嵌入。
因为使用attention机制，所以需要返回每个时刻的输出和状态，GRU单元设定return_sequences=True, return_state=True。使用’glorot_uniform’初始化权重。

class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, encoding_units, batch_size):
        super(Encoder, self).__init__()
        self.batch_size = batch_size
        self.encoding_units = encoding_units
        self.embedding = keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = keras.layers.GRU(self.encoding_units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        return output, state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_size, self.encoding_units))
    
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)

print('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

结果：

Encoder output shape: (batch size, sequence length, units) (64, 16, 1024)
Encoder Hidden state shape: (batch size, units) (64, 1024)

注意力机制

在attention中，我们需要考虑encoder的每个时间步的输出，decoder的每个隐状态。

EO是encoder多个时间步的输出，而H是某一步的输出。由上面的encoder可知，encoder的输出shape为(batch size, max_length, units）；而decoder的某个时间步的隐状态的shape为(batch size, num_decoder_units)。要使得经过全连层后的的输出能够相加，维度必须一致。隐状态要扩充一个时间维度才与encoder的输出shape一致。这里，encoder的输出shape的时间维度在axis=1，使用tf.expand_dims在axis=1上扩充。
使用tf.nn.tanh()激活得到shape为(batch_size, max_length, units)；
再经过只有一个unit=1的全连接层得到shape为(batch_size, max_length, 1)的score。
通过softmat在max_lenght维度上将socore转为0-1之间的数，就是attention_weights，shape为(batch_size, max_length, 1) ；
将attention_weights与encoder的输出相乘，我们知道attention_weights的shape为(batch_size, max_length, 1) ，encoder的输出的shape为(batch size, max_length, units)，第三个维度并不对应，tensorflow会自动将(batch_size, max_length, 1) 扩展为(batch size, max_length, units)，求和得到上下文信息context_vector的shape为(batch size, max_length, units)；tf.reduce_sum在length维度上求和得到最终的上下文信息context_vector，其shape=(batch_size, units)，这里的units是encoder的输出个数，也即是隐层单元个数。

class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, decoder_hidden, encoder_output):
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(decoder_hidden, 1)

        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying score to self.V
        # the shape of the tensor before applying self.V is (batch_size, max_length, units)
        score = self.V(tf.nn.tanh(self.W1(encoder_output) + self.W2(hidden_with_time_axis)))

        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
		# encoder_output shape == (batch_size, max_length, units)
        context_vector = attention_weights * encoder_output
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights
    
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

Attention result shape: (batch size, units) (64, 1024)
Attention weights shape: (batch_size, sequence_length, 1) (64, 16, 1)

解码器

解码器的结构基本与编码器相同，输入不同。训练阶段得到编码器的输出和隐状态。在解码阶段，要结合它们计算出上下文信息并与需要解码的内容拼接得到最终解码器输入。解码器的输入是按照时间一步步输入的，从而得到一步步的输出。 所以给decoder的输入shape应该是（batch_size，1，units）。seq2seq可得到不定长结果是因为解码器最后一层全连接层。

class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, decoding_units, batch_size):
        super(Decoder, self).__init__()
        self.batch_size = batch_size
        self.decoding_units = decoding_units
        self.embedding = keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = keras.layers.GRU(self.decoding_units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')
        self.fc = keras.layers.Dense(vocab_size)

        # used for attention
        self.attention = BahdanauAttention(self.decoding_units)

    def call(self, x, hidden, encoding_output):
        # enc_output shape == (batch_size, max_length, hidden_size)
        context_vector, attention_weights = self.attention(hidden, encoding_output)

        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
		# state shape == (batch_size, hidden_size)
        # output shape == (batch_size, 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))

        # output shape == (batch_size, vocab)
        x = self.fc(output)

        return x, state, attention_weights

decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
sample_decoder_output, decoder_hidden, decoder_aw = decoder(tf.random.uniform((batch_size, 1)),
                                      sample_hidden, sample_output)
print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))
print ('Decoder hidden shape: (batch_size, units) {}'.format(decoder_hidden.shape))
print ('Decoder attention weights shape: (batch_size, length, 1) {}'.format(decoder_aw.shape))

Decoder output shape: (batch_size, vocab size) (64, 4935)
Decoder hidden shape: (batch_size, units) (64, 1024)
Decoder attention weights shape: (batch_size, length, 1) (64, 16, 1)

模型训练

计算单步损失

由于输出是单词对应的ID，是一个整数，所以这里使用SparseCategoricalCrossentropy。因为最终decoder输出是一个全连接层没有经过激活的输出，所以from_logits=True。reduction取none是因为我们不考虑padding部分的loss，在loss_function中通过mask不考虑padding，最后再使用reduce_mean聚合。

optimizer = keras.optimizers.Adam()

loss_object = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

# 计算loss时不考虑padding，添加mask
def loss_function(real, pred):
	# padding的内容为0
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

checkpoint_dir = './10-1_checkpoints'
if not os.path.exists(checkpoint_dir):
    os.mkdir(checkpoint_dir)
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

计算多步loss

这里自定义求导，能够看到模型是如何一步步求出来的。
难点在encoder的输入上。由于这里使用Teacher forcing，即把前一步正确值与新的输入作为decoder的输入。每个句子先从开始，遇到结束。

@tf.function # 转为tf.function 加速
def train_step(inp, targ, encoding_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        encoding_output, encoding_hidden = encoder(inp, encoding_hidden)
		# 将encoder的隐状态赋值给decoder
        decoding_hidden = encoding_hidden
		# 找出第一个输入
        decoding_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, decoding_hidden, _ = decoder(decoding_input, decoding_hidden, encoding_output)
			# 对每个时间步的loss求和
            loss += loss_function(targ[:, t], predictions)

            # using teacher forcing，把前一步正确值与新的输入作为decoder的输入
            decoding_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / int(targ.shape[1]))
    # 获取所有变量
    variables = encoder.trainable_variables + decoder.trainable_variables
    
    # 求梯度，batch_less和loss之间只相差一个常系数，没有差别
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    return batch_loss

训练

EPOCHS = 10

for epoch in range(EPOCHS):
    start = time.time()

    encoding_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, encoding_hidden)
        total_loss += batch_loss

        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, batch_loss.numpy()))
    # saving (checkpoint) the model every 2 epochs
    if (epoch + 1) % 2 == 0:
        checkpoint.save(file_prefix = checkpoint_prefix)

    print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 4.5903
Epoch 1 Batch 100 Loss 2.1396
Epoch 1 Batch 200 Loss 1.8821
Epoch 1 Batch 300 Loss 1.7342
Epoch 1 Loss 2.0275
Time taken for 1 epoch 33.51040720939636 sec

Epoch 2 Batch 0 Loss 1.4921
Epoch 2 Batch 100 Loss 1.4532
Epoch 2 Batch 200 Loss 1.3182
Epoch 2 Batch 300 Loss 1.2971
Epoch 2 Loss 1.3858
Time taken for 1 epoch 17.667675018310547 sec

评估

def evaluate(sentence):
	# 用于画图，反映输入与输入关系
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    # 输入句子预处理
    sentence = preprocess_sentence(sentence)
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    # padding
    inputs = keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)

    result = ''

    hidden = [tf.zeros((1, units))]
    encoding_out, encoding_hidden = encoder(inputs, hidden)

    decoding_hidden = encoding_hidden
    # 评估阶段没有上一步准确的输出，所以把上一步预测的输出与输入结合作为下一目的输入
    # decoder input shape = （1, 1）
    decoding_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

    for t in range(max_length_targ):
        predictions, decoding_hidden, attention_weights = decoder(
            decoding_input, decoding_hidden, encoding_out)

        # storing the attention weights to plot later on
        # shape = (batch_size, input_length, 1)
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()
		# 取预测结果中概率最大的值
        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.index_word[predicted_id] + ' '

        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence, attention_plot

        # the predicted ID is fed back into the model
        decoding_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')

    fontdict = {'fontsize': 14}

    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()
    
def translate(sentence):
    result, sentence, attention_plot = evaluate(sentence)

    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))

    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    plot_attention(attention_plot, sentence.split(' '), result.split(' '))
    

checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))