论文来源:TACL 2017
论文链接:http://tongtianta.site/paper/11096
文本匹配是智能问答(社区问答)中的关键环节,用于判断两个句子的语义是否相似。机器智能问答FAQ中,输入新文本(语音转文本)后,和对话库内已有句子进行匹配,匹配完成后输出对应问题答案。而这里主要研究的就是两个句子如何计算它们之间语义相似度的问题。
一、原理
Enhanced LSTM for Natural Language Inference(ESIM)是最近提出的效果比较好的一个文本相似度计算模型,该模型横扫了很多比赛,我们这里介绍该模型的基本原理和pytorch版本实现。
从ESIM的全名可以看出,这是一种转为自然语言推断而生的加强版LSTM,由原文中知这种精心设计的链式LSTM顺序推理模型可以胜过以前很多复杂的模型。
ESIM的模型架构如图1所示,它的主要组成为:输入编码,局部推理建模和推理组合。图1中垂直方向描绘了三个主要组件,水平方向,左侧代表连续语言推断模型,名为ESIM,右侧代表在树LSTM中包含语法分析信息的网络。

这里用和
分别表示两个句子,其中
表示句子
的长度,
表示句子
的长度,这里在原文中
又称为前提,
为假设。
1.输入编码
采用BiLSTM(双向LSTM)构建输入编码,使用该模型对输入的两个句子分别编码,表示如下(1)(2)所示:
(1)
(2)
我们将输入序列在时间
由BiLSTM生成的隐藏(输出)状态写为
。这同样适用于
。
2.局部推理建模
这里使用点积的方式计算和
之间的attention权重,公式如(3)所示,论文作者提到attention权重他们也尝试了其他更加复杂的方法,但是并没有取得更好的效果。
(3)
计算两个句子之间的交互表示:
(4)
(5)
其中是
的加权求和。
通过计算元组以及
的差和元素乘积来增强局部推理信息。增强后的局部推理信息如下:
(6)
(7)
3.推理组成
使用BiLSTM来提取和
的局部推理信息,对于BiLSTM提取到的信息,作者认为求和对序列长度比较敏感,不太健壮,因此他们采用了:最大池化和平均池化的方案,并将池化后的结果连接为固定长度的向量。如下:
(8)
(9)
(10)
然后将放入一个全连接层分类器中进行分类。
二、pytorch实现
git上有很多版本的实现,经过实际使用发现,很多版本存在各种问题,其中比较明显的问题是有些使用BiLSTM时没有对批进行pad_pack操作,导致BiLSTM提取的特征是错误的,之前做NER的时候这个错误困扰了我很久,各位也要注意pytorch中BiLSTM的用法。详细的ESIM的代码实现见下面,有问题请留言交流。
-
import torch
-
import torch.nn
as nn
-
import torch.nn
as nn
-
from utils.utils
import sort_by_seq_lens, masked_softmax, weighted_sum, get_mask, replace_masked
-
import numpy
as np
-
-
-
# Class widely inspired from:
-
# https://github.com/allenai/allennlp/blob/master/allennlp/modules/input_variational_dropout.py
-
class RNNDropout(nn.Dropout):
-
"""
-
Dropout layer for the inputs of RNNs.
-
Apply the same dropout mask to all the elements of the same sequence in
-
a batch of sequences of size (batch, sequences_length, embedding_dim).
-
"""
-
-
def forward(self, sequences_batch):
-
"""
-
Apply dropout to the input batch of sequences.
-
-
Args:
-
sequences_batch: A batch of sequences of vectors that will serve
-
as input to an RNN.
-
Tensor of size (batch, sequences_length, emebdding_dim).
-
-
Returns:
-
A new tensor on which dropout has been applied.
-
"""
-
ones = sequences_batch.data.new_ones(sequences_batch.shape[
0],
-
sequences_batch.shape[
-1])
-
dropout_mask = nn.functional.dropout(ones, self.p, self.training,
-
inplace=
False)
-
return dropout_mask.unsqueeze(
1) * sequences_batch
-
-
-
class Seq2SeqEncoder(nn.Module):
-
"""
-
RNN taking variable length padded sequences of vectors as input and
-
encoding them into padded sequences of vectors of the same length.
-
-
This module is useful to handle batches of padded sequences of vectors
-
that have different lengths and that need to be passed through a RNN.
-
The sequences are sorted in descending order of their lengths, packed,
-
passed through the RNN, and the resulting sequences are then padded and
-
permuted back to the original order of the input sequences.
-
"""
-
-
def __init__(self,
-
rnn_type,
-
input_size,
-
hidden_size,
-
num_layers=1,
-
bias=True,
-
dropout=0.0,
-
bidirectional=False):
-
"""
-
Args:
-
rnn_type: The type of RNN to use as encoder in the module.
-
Must be a class inheriting from torch.nn.RNNBase
-
(such as torch.nn.LSTM for example).
-
input_size: The number of expected features in the input of the
-
module.
-
hidden_size: The number of features in the hidden state of the RNN
-
used as encoder by the module.
-
num_layers: The number of recurrent layers in the encoder of the
-
module. Defaults to 1.
-
bias: If False, the encoder does not use bias weights b_ih and
-
b_hh. Defaults to True.
-
dropout: If non-zero, introduces a dropout layer on the outputs
-
of each layer of the encoder except the last one, with dropout
-
probability equal to 'dropout'. Defaults to 0.0.
-
bidirectional: If True, the encoder of the module is bidirectional.
-
Defaults to False.
-
"""
-
assert issubclass(rnn_type, nn.RNNBase), \
-
"rnn_type must be a class inheriting from torch.nn.RNNBase"
-
-
super(Seq2SeqEncoder, self).__init__()
-
-
self.rnn_type = rnn_type
-
self.input_size = input_size
-
self.hidden_size = hidden_size
-
self.num_layers = num_layers
-
self.bias = bias
-
self.dropout = dropout
-
self.bidirectional = bidirectional
-
-
self._encoder = rnn_type(input_size,
-
hidden_size,
-
num_layers=num_layers,
-
bias=bias,
-
batch_first=
True,
-
dropout=dropout,
-
bidirectional=bidirectional)
-
-
def forward(self, sequences_batch, sequences_lengths):
-
"""
-
Args:
-
sequences_batch: A batch of variable length sequences of vectors.
-
The batch is assumed to be of size
-
(batch, sequence, vector_dim).
-
sequences_lengths: A 1D tensor containing the sizes of the
-
sequences in the input batch.
-
-
Returns:
-
reordered_outputs: The outputs (hidden states) of the encoder for
-
the sequences in the input batch, in the same order.
-
"""
-
sorted_batch, sorted_lengths, _, restoration_idx = \
-
sort_by_seq_lens(sequences_batch, sequences_lengths)
-
packed_batch = nn.utils.rnn.pack_padded_sequence(sorted_batch,
-
sorted_lengths,
-
batch_first=
True)
-
-
outputs, _ = self._encoder(packed_batch,
None)
-
-
outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs,
-
batch_first=
True)
-
reordered_outputs = outputs.index_select(
0, restoration_idx)
-
-
return reordered_outputs
-
-
-
class SoftmaxAttention(nn.Module):
-
"""
-
Attention layer taking premises and hypotheses encoded by an RNN as input
-
and computing the soft attention between their elements.
-
-
The dot product of the encoded vectors in the premises and hypotheses is
-
first computed. The softmax of the result is then used in a weighted sum
-
of the vectors of the premises for each element of the hypotheses, and
-
conversely for the elements of the premises.
-
"""
-
-
def forward(self,
-
premise_batch,
-
premise_mask,
-
hypothesis_batch,
-
hypothesis_mask):
-
"""
-
Args:
-
premise_batch: A batch of sequences of vectors representing the
-
premises in some NLI task. The batch is assumed to have the
-
size (batch, sequences, vector_dim).
-
premise_mask: A mask for the sequences in the premise batch, to
-
ignore padding data in the sequences during the computation of
-
the attention.
-
hypothesis_batch: A batch of sequences of vectors representing the
-
hypotheses in some NLI task. The batch is assumed to have the
-
size (batch, sequences, vector_dim).
-
hypothesis_mask: A mask for the sequences in the hypotheses batch,
-
to ignore padding data in the sequences during the computation
-
of the attention.
-
-
Returns:
-
attended_premises: The sequences of attention vectors for the
-
premises in the input batch.
-
attended_hypotheses: The sequences of attention vectors for the
-
hypotheses in the input batch.
-
"""
-
# Dot product between premises and hypotheses in each sequence of
-
# the batch.
-
similarity_matrix = premise_batch.bmm(hypothesis_batch.transpose(
2,
1)
-
.contiguous())
-
-
# Softmax attention weights.
-
prem_hyp_attn = masked_softmax(similarity_matrix, hypothesis_mask)
-
hyp_prem_attn = masked_softmax(similarity_matrix.transpose(
1,
2)
-
.contiguous(),
-
premise_mask)
-
-
# Weighted sums of the hypotheses for the the premises attention,
-
# and vice-versa for the attention of the hypotheses.
-
attended_premises = weighted_sum(hypothesis_batch,
-
prem_hyp_attn,
-
premise_mask)
-
attended_hypotheses = weighted_sum(premise_batch,
-
hyp_prem_attn,
-
hypothesis_mask)
-
-
return attended_premises, attended_hypotheses
-
-
-
class ESIM(nn.Module):
-
"""
-
Implementation of the ESIM model presented in the paper "Enhanced LSTM for
-
Natural Language Inference" by Chen et al.
-
"""
-
-
def __init__(self,
-
vocab_size,
-
embedding_dim,
-
hidden_size,
-
embeddings=None,
-
padding_idx=0,
-
dropout=0.1,
-
num_classes=2):
-
"""
-
Args:
-
vocab_size: The size of the vocabulary of embeddings in the model.
-
embedding_dim: The dimension of the word embeddings.
-
hidden_size: The size of all the hidden layers in the network.
-
embeddings: A tensor of size (vocab_size, embedding_dim) containing
-
pretrained word embeddings. If None, word embeddings are
-
initialised randomly. Defaults to None.
-
padding_idx: The index of the padding token in the premises and
-
hypotheses passed as input to the model. Defaults to 0.
-
dropout: The dropout rate to use between the layers of the network.
-
A dropout rate of 0 corresponds to using no dropout at all.
-
Defaults to 0.5.
-
num_classes: The number of classes in the output of the network.
-
Defaults to 3.
-
device: The name of the device on which the model is being
-
executed. Defaults to 'cpu'.
-
"""
-
super(ESIM, self).__init__()
-
-
self.vocab_size = vocab_size
-
self.embedding_dim = embedding_dim
-
self.hidden_size = hidden_size
-
self.num_classes = num_classes
-
self.dropout = dropout
-
-
self._word_embedding = nn.Embedding(self.vocab_size,
-
self.embedding_dim,
-
padding_idx=padding_idx)
-
if embeddings:
-
embvecs, embwords = embeddings
-
self._word_embedding.weight.data.copy_(torch.from_numpy(np.asarray(embvecs)))
-
self._word_embedding.weight.requires_grad =
False
-
-
if self.dropout:
-
self._rnn_dropout = RNNDropout(p=self.dropout)
-
# self._rnn_dropout = nn.Dropout(p=self.dropout)
-
-
self._encoding = Seq2SeqEncoder(nn.LSTM,
-
self.embedding_dim,
-
self.hidden_size,
-
bidirectional=
True)
-
-
self._attention = SoftmaxAttention()
-
-
self._projection = nn.Sequential(nn.Linear(
4 *
2 * self.hidden_size,
-
self.hidden_size),
-
nn.ReLU())
-
-
self._composition = Seq2SeqEncoder(nn.LSTM,
-
self.hidden_size,
-
self.hidden_size,
-
bidirectional=
True)
-
-
self._classification = nn.Sequential(nn.Dropout(p=self.dropout),
-
nn.Linear(
2 *
4 * self.hidden_size,
-
self.hidden_size),
-
nn.Tanh(),
-
nn.Dropout(p=self.dropout),
-
nn.Linear(self.hidden_size,
-
self.num_classes))
-
-
# Initialize all weights and biases in the model.
-
self.apply(_init_esim_weights)
-
-
def forward(self,
-
text_a_ids,
-
text_b_ids,
-
q1_char_inputs=None,
-
q2_char_inputs=None,
-
q1_lens=None,
-
q2_lens=None,
-
device='cpu'
-
):
-
"""
-
Args:
-
premises: A batch of varaible length sequences of word indices
-
representing premises. The batch is assumed to be of size
-
(batch, premises_length).
-
premises_lengths: A 1D tensor containing the lengths of the
-
premises in 'premises'.
-
hypothesis: A batch of varaible length sequences of word indices
-
representing hypotheses. The batch is assumed to be of size
-
(batch, hypotheses_length).
-
hypotheses_lengths: A 1D tensor containing the lengths of the
-
hypotheses in 'hypotheses'.
-
-
Returns:
-
logits: A tensor of size (batch, num_classes) containing the
-
logits for each output class of the model.
-
probabilities: A tensor of size (batch, num_classes) containing
-
the probabilities of each output class in the model.
-
"""
-
premises = text_a_ids
-
premises_lengths = q1_lens
-
hypotheses = text_b_ids
-
hypotheses_lengths = q2_lens
-
-
premises_mask = get_mask(premises, premises_lengths).to(device)
-
hypotheses_mask = get_mask(hypotheses, hypotheses_lengths).to(device)
-
-
embedded_premises = self._word_embedding(premises)
-
embedded_hypotheses = self._word_embedding(hypotheses)
-
-
if self.dropout:
-
embedded_premises = self._rnn_dropout(embedded_premises)
-
embedded_hypotheses = self._rnn_dropout(embedded_hypotheses)
-
-
encoded_premises = self._encoding(embedded_premises,
-
premises_lengths)
-
encoded_hypotheses = self._encoding(embedded_hypotheses,
-
hypotheses_lengths)
-
-
attended_premises, attended_hypotheses = self._attention(encoded_premises, premises_mask,
-
encoded_hypotheses, hypotheses_mask)
-
-
enhanced_premises = torch.cat([encoded_premises,
-
attended_premises,
-
encoded_premises - attended_premises,
-
encoded_premises * attended_premises],
-
dim=
-1)
-
enhanced_hypotheses = torch.cat([encoded_hypotheses,
-
attended_hypotheses,
-
encoded_hypotheses -
-
attended_hypotheses,
-
encoded_hypotheses *
-
attended_hypotheses],
-
dim=
-1)
-
-
projected_premises = self._projection(enhanced_premises)
-
projected_hypotheses = self._projection(enhanced_hypotheses)
-
-
if self.dropout:
-
projected_premises = self._rnn_dropout(projected_premises)
-
projected_hypotheses = self._rnn_dropout(projected_hypotheses)
-
-
v_ai = self._composition(projected_premises, premises_lengths)
-
v_bj = self._composition(projected_hypotheses, hypotheses_lengths)
-
-
v_a_avg = torch.sum(v_ai * premises_mask.unsqueeze(
1)
-
.transpose(
2,
1), dim=
1) \
-
/ torch.sum(premises_mask, dim=
1, keepdim=
True)
-
v_b_avg = torch.sum(v_bj * hypotheses_mask.unsqueeze(
1)
-
.transpose(
2,
1), dim=
1) \
-
/ torch.sum(hypotheses_mask, dim=
1, keepdim=
True)
-
-
v_a_max, _ = replace_masked(v_ai, premises_mask,
-1e7).max(dim=
1)
-
v_b_max, _ = replace_masked(v_bj, hypotheses_mask,
-1e7).max(dim=
1)
-
-
v = torch.cat([v_a_avg, v_a_max, v_b_avg, v_b_max], dim=
1)
-
-
logits = self._classification(v)
-
# probabilities = nn.functional.softmax(logits, dim=-1)
-
-
return logits
-
-
-
def _init_esim_weights(module):
-
"""
-
Initialise the weights of the ESIM model.
-
"""
-
if isinstance(module, nn.Linear):
-
nn.init.kaiming_normal_(module.weight.data)
-
nn.init.constant_(module.bias.data,
0.0)
-
-
elif isinstance(module, nn.LSTM):
-
nn.init.kaiming_normal_(module.weight_ih_l0.data)
-
nn.init.orthogonal_(module.weight_hh_l0.data)
-
nn.init.constant_(module.bias_ih_l0.data,
0.0)
-
nn.init.constant_(module.bias_hh_l0.data,
0.0)
-
hidden_size = module.bias_hh_l0.data.shape[
0] //
4
-
module.bias_hh_l0.data[hidden_size:(
2 * hidden_size)] =
1.0
-
-
if (module.bidirectional):
-
nn.init.kaiming_normal_(module.weight_ih_l0_reverse.data)
-
nn.init.orthogonal_(module.weight_hh_l0_reverse.data)
-
nn.init.constant_(module.bias_ih_l0_reverse.data,
0.0)
-
nn.init.constant_(module.bias_hh_l0_reverse.data,
0.0)
-
module.bias_hh_l0_reverse.data[hidden_size:(
2 * hidden_size)] =
1.0
utils代码:
-
-
import torch
-
import torch.nn
as nn
-
-
-
# Code widely inspired from:
-
# https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
-
def sort_by_seq_lens(batch, sequences_lengths, descending=True):
-
"""
-
Sort a batch of padded variable length sequences by length.
-
-
Args:
-
batch: A batch of padded variable length sequences. The batch should
-
have the dimensions (batch_size x max_sequence_length x *).
-
sequences_lengths: A tensor containing the lengths of the sequences in the
-
input batch. The tensor should be of size (batch_size).
-
descending: A boolean value indicating whether to sort the sequences
-
by their lengths in descending order. Defaults to True.
-
-
Returns:
-
sorted_batch: A tensor containing the input batch reordered by
-
sequences lengths.
-
sorted_seq_lens: A tensor containing the sorted lengths of the
-
sequences in the input batch.
-
sorting_idx: A tensor containing the indices used to permute the input
-
batch in order to get 'sorted_batch'.
-
restoration_idx: A tensor containing the indices that can be used to
-
restore the order of the sequences in 'sorted_batch' so that it
-
matches the input batch.
-
"""
-
sorted_seq_lens, sorting_index =\
-
sequences_lengths.sort(
0, descending=descending)
-
-
sorted_batch = batch.index_select(
0, sorting_index)
-
-
idx_range =\
-
sequences_lengths.new_tensor(torch.arange(
0, len(sequences_lengths)))
-
_, reverse_mapping = sorting_index.sort(
0, descending=
False)
-
restoration_index = idx_range.index_select(
0, reverse_mapping)
-
-
return sorted_batch, sorted_seq_lens, sorting_index, restoration_index
-
-
-
def get_mask(sequences_batch, sequences_lengths):
-
"""
-
Get the mask for a batch of padded variable length sequences.
-
-
Args:
-
sequences_batch: A batch of padded variable length sequences
-
containing word indices. Must be a 2-dimensional tensor of size
-
(batch, sequence).
-
sequences_lengths: A tensor containing the lengths of the sequences in
-
'sequences_batch'. Must be of size (batch).
-
-
Returns:
-
A mask of size (batch, max_sequence_length), where max_sequence_length
-
is the length of the longest sequence in the batch.
-
"""
-
batch_size = sequences_batch.size()[
0]
-
max_length = torch.max(sequences_lengths)
-
mask = torch.ones(batch_size, max_length, dtype=torch.float)
-
mask[sequences_batch[:, :max_length] ==
0] =
0.0
-
return mask
-
-
def get_atten_mask(sequences_batch, sequences_lengths):
-
"""
-
Get the mask for atten.
-
-
Args:
-
sequences_batch: A batch of padded variable length sequences
-
containing word indices. Must be a 2-dimensional tensor of size
-
(batch, sequence).
-
sequences_lengths: A tensor containing the lengths of the sequences in
-
'sequences_batch'. Must be of size (batch).
-
-
Returns:
-
A mask of size (batch, max_sequence_length), where max_sequence_length
-
is the length of the longest sequence in the batch.
-
"""
-
batch_size = sequences_batch.size()[
0]
-
max_length = torch.max(sequences_lengths)
-
mask = torch.zeros(batch_size, max_length, dtype=torch.float)
-
mask[sequences_batch[:, :max_length] ==
0] =
1.0
-
return mask
-
-
-
# Code widely inspired from:
-
# https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
-
def masked_softmax(tensor, mask):
-
"""
-
Apply a masked softmax on the last dimension of a tensor.
-
The input tensor and mask should be of size (batch, *, sequence_length).
-
-
Args:
-
tensor: The tensor on which the softmax function must be applied along
-
the last dimension.
-
mask: A mask of the same size as the tensor with 0s in the positions of
-
the values that must be masked and 1s everywhere else.
-
-
Returns:
-
A tensor of the same size as the inputs containing the result of the
-
softmax.
-
"""
-
tensor_shape = tensor.size()
-
reshaped_tensor = tensor.view(
-1, tensor_shape[
-1])
-
-
# Reshape the mask so it matches the size of the input tensor.
-
while mask.dim() < tensor.dim():
-
mask = mask.unsqueeze(
1)
-
mask = mask.expand_as(tensor).contiguous().float()
-
reshaped_mask = mask.view(
-1, mask.size()[
-1])
-
-
result = nn.functional.softmax(reshaped_tensor * reshaped_mask, dim=
-1)
-
result = result * reshaped_mask
-
# 1e-13 is added to avoid divisions by zero.
-
result = result / (result.sum(dim=
-1, keepdim=
True) +
1e-13)
-
-
return result.view(*tensor_shape)
-
-
-
# Code widely inspired from:
-
# https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
-
def weighted_sum(tensor, weights, mask):
-
"""
-
Apply a weighted sum on the vectors along the last dimension of 'tensor',
-
and mask the vectors in the result with 'mask'.
-
-
Args:
-
tensor: A tensor of vectors on which a weighted sum must be applied.
-
weights: The weights to use in the weighted sum.
-
mask: A mask to apply on the result of the weighted sum.
-
-
Returns:
-
A new tensor containing the result of the weighted sum after the mask
-
has been applied on it.
-
"""
-
weighted_sum = weights.bmm(tensor)
-
-
while mask.dim() < weighted_sum.dim():
-
mask = mask.unsqueeze(
1)
-
mask = mask.transpose(
-1,
-2)
-
mask = mask.expand_as(weighted_sum).contiguous().float()
-
-
return weighted_sum * mask
-
-
-
# Code inspired from:
-
# https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
-
def replace_masked(tensor, mask, value):
-
"""
-
Replace the all the values of vectors in 'tensor' that are masked in
-
'masked' by 'value'.
-
-
Args:
-
tensor: The tensor in which the masked vectors must have their values
-
replaced.
-
mask: A mask indicating the vectors which must have their values
-
replaced.
-
value: The value to place in the masked vectors of 'tensor'.
-
-
Returns:
-
A new tensor of the same size as 'tensor' where the values of the
-
vectors masked in 'mask' were replaced by 'value'.
-
"""
-
mask = mask.unsqueeze(
1).transpose(
2,
1)
-
reverse_mask =
1.0 - mask
-
values_to_add = value * reverse_mask
-
return tensor * mask + values_to_add