短文本匹配模型-ESIM

最新推荐文章于 2025-05-20 09:00:00 发布

sunhua93

最新推荐文章于 2025-05-20 09:00:00 发布

阅读量1k

点赞数

CC 4.0 BY-SA版权

原文链接：https://blog.csdn.net/zhang2010hao/article/details/87913910

本文链接： https://blog.csdn.net/zhang2010hao/article/details/87913910

论文来源：TACL 2017

论文链接：http://tongtianta.site/paper/11096

文本匹配是智能问答（社区问答）中的关键环节，用于判断两个句子的语义是否相似。机器智能问答FAQ中，输入新文本(语音转文本)后，和对话库内已有句子进行匹配，匹配完成后输出对应问题答案。而这里主要研究的就是两个句子如何计算它们之间语义相似度的问题。

一、原理

Enhanced LSTM for Natural Language Inference(ESIM)是最近提出的效果比较好的一个文本相似度计算模型，该模型横扫了很多比赛，我们这里介绍该模型的基本原理和pytorch版本实现。

从ESIM的全名可以看出，这是一种转为自然语言推断而生的加强版LSTM，由原文中知这种精心设计的链式LSTM顺序推理模型可以胜过以前很多复杂的模型。

ESIM的模型架构如图1所示，它的主要组成为：输入编码，局部推理建模和推理组合。图1中垂直方向描绘了三个主要组件，水平方向，左侧代表连续语言推断模型，名为ESIM，右侧代表在树LSTM中包含语法分析信息的网络。

这里用 $a=(a_1, ...,a_{l_a})$ 和 $b=(b_1, ...,b_{l_b})$ 分别表示两个句子，其中 l_a 表示句子的长度， l_b 表示句子的长度，这里在原文中又称为前提，为假设。

1.输入编码

采用BiLSTM（双向LSTM）构建输入编码，使用该模型对输入的两个句子分别编码，表示如下（1）（2）所示：

$\bar{a}_{i}=BiLSTM(a, i),\forall i\in [1, ...,l_a]$ (1)

$\bar{b}_{i}=BiLSTM(b, i),\forall i\in [1, ...,l_b]$ (2)

我们将输入序列在时间由BiLSTM生成的隐藏（输出）状态写为 $\bar{a}$ 。这同样适用于。

2.局部推理建模

这里使用点积的方式计算 $\bar{a}$ 和 $\bar{b}$ 之间的attention权重，公式如（3）所示，论文作者提到attention权重他们也尝试了其他更加复杂的方法，但是并没有取得更好的效果。

$e_{ij}=\bar{a}_{i}^{T}\bar{b}_j$ (3)

计算两个句子之间的交互表示：

$\tilde{a}_{i}=\sum _{j=1}^{l_b}\frac{exp(e_{ij})}{\sum_{k=1}^{l_b}exp(e_{ik})}\bar{b_{j}},\forall i\in [1, ...,l_a]$ (4)

$\tilde{b}_{j}=\sum _{i=1}^{l_a}\frac{exp(e_{ij})}{\sum_{k=1}^{l_b}exp(e_{kj})}\bar{a_{i}},\forall j\in [1, ...,l_b]$ (5)

其中 $\tilde{a}_i$ 是 $\{\bar{b}_j\}_{j=1}^{l_b}$ 的加权求和。

通过计算元组 $<\bar{a}, \tilde{a}>$ 以及 $<\bar{b}, \tilde{b}>$ 的差和元素乘积来增强局部推理信息。增强后的局部推理信息如下：

$m_a=[\bar{a};\tilde{a};\bar{a}-\tilde{a};\bar{a}\bigodot \tilde{a}]$ (6)

$m_b=[\bar{b};\tilde{b};\bar{b}-\tilde{b};\bar{b}\bigodot \tilde{b}]$ (7)

3.推理组成

使用BiLSTM来提取 m_a 和 m_b 的局部推理信息，对于BiLSTM提取到的信息，作者认为求和对序列长度比较敏感，不太健壮，因此他们采用了：最大池化和平均池化的方案，并将池化后的结果连接为固定长度的向量。如下：

$v_{a,avg}=\sum_{i=1}^{l_a}\frac{v_{a, i}}{l_a},\quad v_{a,max}=max_{i=1}^{l_a}v_{a,i}$ (8)

$v_{b,avg}=\sum_{j=1}^{l_b}\frac{v_{a, j}}{l_b},\quad v_{b,max}=max_{j=1}^{l_b}v_{b,j}$ (9)

$v=[v_{a,ave};v_{a,max};v_{b,ave};v_{b;max}]$ (10)

然后将 $v$ 放入一个全连接层分类器中进行分类。

二、pytorch实现

git上有很多版本的实现，经过实际使用发现，很多版本存在各种问题，其中比较明显的问题是有些使用BiLSTM时没有对批进行pad_pack操作，导致BiLSTM提取的特征是错误的，之前做NER的时候这个错误困扰了我很久，各位也要注意pytorch中BiLSTM的用法。详细的ESIM的代码实现见下面，有问题请留言交流。


 
   
    
   
   
    
     import torch
    
   

   
    
   
   
    
     import torch.nn 
     as nn
    
   

   
    
   
   
    
     import torch.nn 
     as nn
    
   

   
    
   
   
    
     from utils.utils 
     import sort_by_seq_lens, masked_softmax, weighted_sum, get_mask, replace_masked
    
   

   
    
   
   
    
     import numpy 
     as np
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     # Class widely inspired from:
    
   

   
    
   
   
    
     # https://github.com/allenai/allennlp/blob/master/allennlp/modules/input_variational_dropout.py
    
   

   
    
   
   
    
     class RNNDropout(nn.Dropout):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Dropout layer for the inputs of RNNs.
    
   

   
    
   
   
    
         Apply the same dropout mask to all the elements of the same sequence in
    
   

   
    
   
   
    
         a batch of sequences of size (batch, sequences_length, embedding_dim).
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     def forward(self, sequences_batch):
    
   

   
    
   
   
            
     """
    
   

   
    
   
   
    
             Apply dropout to the input batch of sequences.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
             Args:
    
   

   
    
   
   
    
                 sequences_batch: A batch of sequences of vectors that will serve
    
   

   
    
   
   
    
                     as input to an RNN.
    
   

   
    
   
   
    
                     Tensor of size (batch, sequences_length, emebdding_dim).
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
             Returns:
    
   

   
    
   
   
    
                 A new tensor on which dropout has been applied.
    
   

   
    
   
   
    
             """
    
   

   
    
   
   
    
             ones = sequences_batch.data.new_ones(sequences_batch.shape[
     0],
    
   

   
    
   
   
    
                                                  sequences_batch.shape[
     -1])
    
   

   
    
   
   
    
             dropout_mask = nn.functional.dropout(ones, self.p, self.training,
    
   

   
    
   
   
    
                                                  inplace=
     False)
    
   

   
    
   
   
            
     return dropout_mask.unsqueeze(
     1) * sequences_batch
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     class Seq2SeqEncoder(nn.Module):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         RNN taking variable length padded sequences of vectors as input and
    
   

   
    
   
   
    
         encoding them into padded sequences of vectors of the same length.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         This module is useful to handle batches of padded sequences of vectors
    
   

   
    
   
   
    
         that have different lengths and that need to be passed through a RNN.
    
   

   
    
   
   
    
         The sequences are sorted in descending order of their lengths, packed,
    
   

   
    
   
   
    
         passed through the RNN, and the resulting sequences are then padded and
    
   

   
    
   
   
    
         permuted back to the original order of the input sequences.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     def __init__(self,
    
   

   
    
   
   
    
                      rnn_type,
    
   

   
    
   
   
    
                      input_size,
    
   

   
    
   
   
    
                      hidden_size,
    
   

   
    
   
   
    
                      num_layers=1,
    
   

   
    
   
   
    
                      bias=True,
    
   

   
    
   
   
    
                      dropout=0.0,
    
   

   
    
   
   
    
                      bidirectional=False):
    
   

   
    
   
   
            
     """
    
   

   
    
   
   
    
             Args:
    
   

   
    
   
   
    
                 rnn_type: The type of RNN to use as encoder in the module.
    
   

   
    
   
   
    
                     Must be a class inheriting from torch.nn.RNNBase
    
   

   
    
   
   
    
                     (such as torch.nn.LSTM for example).
    
   

   
    
   
   
    
                 input_size: The number of expected features in the input of the
    
   

   
    
   
   
    
                     module.
    
   

   
    
   
   
    
                 hidden_size: The number of features in the hidden state of the RNN
    
   

   
    
   
   
    
                     used as encoder by the module.
    
   

   
    
   
   
    
                 num_layers: The number of recurrent layers in the encoder of the
    
   

   
    
   
   
    
                     module. Defaults to 1.
    
   

   
    
   
   
    
                 bias: If False, the encoder does not use bias weights b_ih and
    
   

   
    
   
   
    
                     b_hh. Defaults to True.
    
   

   
    
   
   
    
                 dropout: If non-zero, introduces a dropout layer on the outputs
    
   

   
    
   
   
    
                     of each layer of the encoder except the last one, with dropout
    
   

   
    
   
   
    
                     probability equal to 'dropout'. Defaults to 0.0.
    
   

   
    
   
   
    
                 bidirectional: If True, the encoder of the module is bidirectional.
    
   

   
    
   
   
    
                     Defaults to False.
    
   

   
    
   
   
    
             """
    
   

   
    
   
   
            
     assert issubclass(rnn_type, nn.RNNBase), \
    
   

   
    
   
   
                
     "rnn_type must be a class inheriting from torch.nn.RNNBase"
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             super(Seq2SeqEncoder, self).__init__()
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self.rnn_type = rnn_type
    
   

   
    
   
   
    
             self.input_size = input_size
    
   

   
    
   
   
    
             self.hidden_size = hidden_size
    
   

   
    
   
   
    
             self.num_layers = num_layers
    
   

   
    
   
   
    
             self.bias = bias
    
   

   
    
   
   
    
             self.dropout = dropout
    
   

   
    
   
   
    
             self.bidirectional = bidirectional
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self._encoder = rnn_type(input_size,
    
   

   
    
   
   
    
                                      hidden_size,
    
   

   
    
   
   
    
                                      num_layers=num_layers,
    
   

   
    
   
   
    
                                      bias=bias,
    
   

   
    
   
   
    
                                      batch_first=
     True,
    
   

   
    
   
   
    
                                      dropout=dropout,
    
   

   
    
   
   
    
                                      bidirectional=bidirectional)
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     def forward(self, sequences_batch, sequences_lengths):
    
   

   
    
   
   
            
     """
    
   

   
    
   
   
    
             Args:
    
   

   
    
   
   
    
                 sequences_batch: A batch of variable length sequences of vectors.
    
   

   
    
   
   
    
                     The batch is assumed to be of size
    
   

   
    
   
   
    
                     (batch, sequence, vector_dim).
    
   

   
    
   
   
    
                 sequences_lengths: A 1D tensor containing the sizes of the
    
   

   
    
   
   
    
                     sequences in the input batch.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
             Returns:
    
   

   
    
   
   
    
                 reordered_outputs: The outputs (hidden states) of the encoder for
    
   

   
    
   
   
    
                     the sequences in the input batch, in the same order.
    
   

   
    
   
   
    
             """
    
   

   
    
   
   
    
             sorted_batch, sorted_lengths, _, restoration_idx = \
    
   

   
    
   
   
    
                 sort_by_seq_lens(sequences_batch, sequences_lengths)
    
   

   
    
   
   
    
             packed_batch = nn.utils.rnn.pack_padded_sequence(sorted_batch,
    
   

   
    
   
   
    
                                                              sorted_lengths,
    
   

   
    
   
   
    
                                                              batch_first=
     True)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             outputs, _ = self._encoder(packed_batch, 
     None)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs,
    
   

   
    
   
   
    
                                                           batch_first=
     True)
    
   

   
    
   
   
    
             reordered_outputs = outputs.index_select(
     0, restoration_idx)
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     return reordered_outputs
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     class SoftmaxAttention(nn.Module):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Attention layer taking premises and hypotheses encoded by an RNN as input
    
   

   
    
   
   
    
         and computing the soft attention between their elements.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         The dot product of the encoded vectors in the premises and hypotheses is
    
   

   
    
   
   
    
         first computed. The softmax of the result is then used in a weighted sum
    
   

   
    
   
   
    
         of the vectors of the premises for each element of the hypotheses, and
    
   

   
    
   
   
    
         conversely for the elements of the premises.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     def forward(self,
    
   

   
    
   
   
    
                     premise_batch,
    
   

   
    
   
   
    
                     premise_mask,
    
   

   
    
   
   
    
                     hypothesis_batch,
    
   

   
    
   
   
    
                     hypothesis_mask):
    
   

   
    
   
   
            
     """
    
   

   
    
   
   
    
             Args:
    
   

   
    
   
   
    
                 premise_batch: A batch of sequences of vectors representing the
    
   

   
    
   
   
    
                     premises in some NLI task. The batch is assumed to have the
    
   

   
    
   
   
    
                     size (batch, sequences, vector_dim).
    
   

   
    
   
   
    
                 premise_mask: A mask for the sequences in the premise batch, to
    
   

   
    
   
   
    
                     ignore padding data in the sequences during the computation of
    
   

   
    
   
   
    
                     the attention.
    
   

   
    
   
   
    
                 hypothesis_batch: A batch of sequences of vectors representing the
    
   

   
    
   
   
    
                     hypotheses in some NLI task. The batch is assumed to have the
    
   

   
    
   
   
    
                     size (batch, sequences, vector_dim).
    
   

   
    
   
   
    
                 hypothesis_mask: A mask for the sequences in the hypotheses batch,
    
   

   
    
   
   
    
                     to ignore padding data in the sequences during the computation
    
   

   
    
   
   
    
                     of the attention.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
             Returns:
    
   

   
    
   
   
    
                 attended_premises: The sequences of attention vectors for the
    
   

   
    
   
   
    
                     premises in the input batch.
    
   

   
    
   
   
    
                 attended_hypotheses: The sequences of attention vectors for the
    
   

   
    
   
   
    
                     hypotheses in the input batch.
    
   

   
    
   
   
    
             """
    
   

   
    
   
   
            
     # Dot product between premises and hypotheses in each sequence of
    
   

   
    
   
   
            
     # the batch.
    
   

   
    
   
   
    
             similarity_matrix = premise_batch.bmm(hypothesis_batch.transpose(
     2, 
     1)
    
   

   
    
   
   
    
                                                   .contiguous())
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     # Softmax attention weights.
    
   

   
    
   
   
    
             prem_hyp_attn = masked_softmax(similarity_matrix, hypothesis_mask)
    
   

   
    
   
   
    
             hyp_prem_attn = masked_softmax(similarity_matrix.transpose(
     1, 
     2)
    
   

   
    
   
   
    
                                            .contiguous(),
    
   

   
    
   
   
    
                                            premise_mask)
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     # Weighted sums of the hypotheses for the the premises attention,
    
   

   
    
   
   
            
     # and vice-versa for the attention of the hypotheses.
    
   

   
    
   
   
    
             attended_premises = weighted_sum(hypothesis_batch,
    
   

   
    
   
   
    
                                              prem_hyp_attn,
    
   

   
    
   
   
    
                                              premise_mask)
    
   

   
    
   
   
    
             attended_hypotheses = weighted_sum(premise_batch,
    
   

   
    
   
   
    
                                                hyp_prem_attn,
    
   

   
    
   
   
    
                                                hypothesis_mask)
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     return attended_premises, attended_hypotheses
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     class ESIM(nn.Module):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Implementation of the ESIM model presented in the paper "Enhanced LSTM for
    
   

   
    
   
   
    
         Natural Language Inference" by Chen et al.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     def __init__(self,
    
   

   
    
   
   
    
                      vocab_size,
    
   

   
    
   
   
    
                      embedding_dim,
    
   

   
    
   
   
    
                      hidden_size,
    
   

   
    
   
   
    
                      embeddings=None,
    
   

   
    
   
   
    
                      padding_idx=0,
    
   

   
    
   
   
    
                      dropout=0.1,
    
   

   
    
   
   
    
                      num_classes=2):
    
   

   
    
   
   
            
     """
    
   

   
    
   
   
    
             Args:
    
   

   
    
   
   
    
                 vocab_size: The size of the vocabulary of embeddings in the model.
    
   

   
    
   
   
    
                 embedding_dim: The dimension of the word embeddings.
    
   

   
    
   
   
    
                 hidden_size: The size of all the hidden layers in the network.
    
   

   
    
   
   
    
                 embeddings: A tensor of size (vocab_size, embedding_dim) containing
    
   

   
    
   
   
    
                     pretrained word embeddings. If None, word embeddings are
    
   

   
    
   
   
    
                     initialised randomly. Defaults to None.
    
   

   
    
   
   
    
                 padding_idx: The index of the padding token in the premises and
    
   

   
    
   
   
    
                     hypotheses passed as input to the model. Defaults to 0.
    
   

   
    
   
   
    
                 dropout: The dropout rate to use between the layers of the network.
    
   

   
    
   
   
    
                     A dropout rate of 0 corresponds to using no dropout at all.
    
   

   
    
   
   
    
                     Defaults to 0.5.
    
   

   
    
   
   
    
                 num_classes: The number of classes in the output of the network.
    
   

   
    
   
   
    
                     Defaults to 3.
    
   

   
    
   
   
    
                 device: The name of the device on which the model is being
    
   

   
    
   
   
    
                     executed. Defaults to 'cpu'.
    
   

   
    
   
   
    
             """
    
   

   
    
   
   
    
             super(ESIM, self).__init__()
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self.vocab_size = vocab_size
    
   

   
    
   
   
    
             self.embedding_dim = embedding_dim
    
   

   
    
   
   
    
             self.hidden_size = hidden_size
    
   

   
    
   
   
    
             self.num_classes = num_classes
    
   

   
    
   
   
    
             self.dropout = dropout
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self._word_embedding = nn.Embedding(self.vocab_size,
    
   

   
    
   
   
    
                                                 self.embedding_dim,
    
   

   
    
   
   
    
                                                 padding_idx=padding_idx)
    
   

   
    
   
   
            
     if embeddings:
    
   

   
    
   
   
    
                 embvecs, embwords = embeddings
    
   

   
    
   
   
    
                 self._word_embedding.weight.data.copy_(torch.from_numpy(np.asarray(embvecs)))
    
   

   
    
   
   
    
             self._word_embedding.weight.requires_grad = 
     False
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     if self.dropout:
    
   

   
    
   
   
    
                 self._rnn_dropout = RNNDropout(p=self.dropout)
    
   

   
    
   
   
                
     # self._rnn_dropout = nn.Dropout(p=self.dropout)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self._encoding = Seq2SeqEncoder(nn.LSTM,
    
   

   
    
   
   
    
                                             self.embedding_dim,
    
   

   
    
   
   
    
                                             self.hidden_size,
    
   

   
    
   
   
    
                                             bidirectional=
     True)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self._attention = SoftmaxAttention()
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self._projection = nn.Sequential(nn.Linear(
     4 * 
     2 * self.hidden_size,
    
   

   
    
   
   
    
                                                        self.hidden_size),
    
   

   
    
   
   
    
                                              nn.ReLU())
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self._composition = Seq2SeqEncoder(nn.LSTM,
    
   

   
    
   
   
    
                                                self.hidden_size,
    
   

   
    
   
   
    
                                                self.hidden_size,
    
   

   
    
   
   
    
                                                bidirectional=
     True)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             self._classification = nn.Sequential(nn.Dropout(p=self.dropout),
    
   

   
    
   
   
    
                                                  nn.Linear(
     2 * 
     4 * self.hidden_size,
    
   

   
    
   
   
    
                                                            self.hidden_size),
    
   

   
    
   
   
    
                                                  nn.Tanh(),
    
   

   
    
   
   
    
                                                  nn.Dropout(p=self.dropout),
    
   

   
    
   
   
    
                                                  nn.Linear(self.hidden_size,
    
   

   
    
   
   
    
                                                            self.num_classes))
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     # Initialize all weights and biases in the model.
    
   

   
    
   
   
    
             self.apply(_init_esim_weights)
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     def forward(self,
    
   

   
    
   
   
    
                     text_a_ids,
    
   

   
    
   
   
    
                     text_b_ids,
    
   

   
    
   
   
    
                     q1_char_inputs=None,
    
   

   
    
   
   
    
                     q2_char_inputs=None,
    
   

   
    
   
   
    
                     q1_lens=None,
    
   

   
    
   
   
    
                     q2_lens=None,
    
   

   
    
   
   
    
                     device='cpu'
    
   

   
    
   
   
    
                     ):
    
   

   
    
   
   
            
     """
    
   

   
    
   
   
    
             Args:
    
   

   
    
   
   
    
                 premises: A batch of varaible length sequences of word indices
    
   

   
    
   
   
    
                     representing premises. The batch is assumed to be of size
    
   

   
    
   
   
    
                     (batch, premises_length).
    
   

   
    
   
   
    
                 premises_lengths: A 1D tensor containing the lengths of the
    
   

   
    
   
   
    
                     premises in 'premises'.
    
   

   
    
   
   
    
                 hypothesis: A batch of varaible length sequences of word indices
    
   

   
    
   
   
    
                     representing hypotheses. The batch is assumed to be of size
    
   

   
    
   
   
    
                     (batch, hypotheses_length).
    
   

   
    
   
   
    
                 hypotheses_lengths: A 1D tensor containing the lengths of the
    
   

   
    
   
   
    
                     hypotheses in 'hypotheses'.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
             Returns:
    
   

   
    
   
   
    
                 logits: A tensor of size (batch, num_classes) containing the
    
   

   
    
   
   
    
                     logits for each output class of the model.
    
   

   
    
   
   
    
                 probabilities: A tensor of size (batch, num_classes) containing
    
   

   
    
   
   
    
                     the probabilities of each output class in the model.
    
   

   
    
   
   
    
             """
    
   

   
    
   
   
    
             premises = text_a_ids
    
   

   
    
   
   
    
             premises_lengths = q1_lens
    
   

   
    
   
   
    
             hypotheses = text_b_ids
    
   

   
    
   
   
    
             hypotheses_lengths = q2_lens
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             premises_mask = get_mask(premises, premises_lengths).to(device)
    
   

   
    
   
   
    
             hypotheses_mask = get_mask(hypotheses, hypotheses_lengths).to(device)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             embedded_premises = self._word_embedding(premises)
    
   

   
    
   
   
    
             embedded_hypotheses = self._word_embedding(hypotheses)
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     if self.dropout:
    
   

   
    
   
   
    
                 embedded_premises = self._rnn_dropout(embedded_premises)
    
   

   
    
   
   
    
                 embedded_hypotheses = self._rnn_dropout(embedded_hypotheses)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             encoded_premises = self._encoding(embedded_premises,
    
   

   
    
   
   
    
                                               premises_lengths)
    
   

   
    
   
   
    
             encoded_hypotheses = self._encoding(embedded_hypotheses,
    
   

   
    
   
   
    
                                                 hypotheses_lengths)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             attended_premises, attended_hypotheses = self._attention(encoded_premises, premises_mask,
    
   

   
    
   
   
    
                                                                      encoded_hypotheses, hypotheses_mask)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             enhanced_premises = torch.cat([encoded_premises,
    
   

   
    
   
   
    
                                            attended_premises,
    
   

   
    
   
   
    
                                            encoded_premises - attended_premises,
    
   

   
    
   
   
    
                                            encoded_premises * attended_premises],
    
   

   
    
   
   
    
                                           dim=
     -1)
    
   

   
    
   
   
    
             enhanced_hypotheses = torch.cat([encoded_hypotheses,
    
   

   
    
   
   
    
                                              attended_hypotheses,
    
   

   
    
   
   
    
                                              encoded_hypotheses -
    
   

   
    
   
   
    
                                              attended_hypotheses,
    
   

   
    
   
   
    
                                              encoded_hypotheses *
    
   

   
    
   
   
    
                                              attended_hypotheses],
    
   

   
    
   
   
    
                                             dim=
     -1)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             projected_premises = self._projection(enhanced_premises)
    
   

   
    
   
   
    
             projected_hypotheses = self._projection(enhanced_hypotheses)
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     if self.dropout:
    
   

   
    
   
   
    
                 projected_premises = self._rnn_dropout(projected_premises)
    
   

   
    
   
   
    
                 projected_hypotheses = self._rnn_dropout(projected_hypotheses)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             v_ai = self._composition(projected_premises, premises_lengths)
    
   

   
    
   
   
    
             v_bj = self._composition(projected_hypotheses, hypotheses_lengths)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             v_a_avg = torch.sum(v_ai * premises_mask.unsqueeze(
     1)
    
   

   
    
   
   
    
                                 .transpose(
     2, 
     1), dim=
     1) \
    
   

   
    
   
   
    
                       / torch.sum(premises_mask, dim=
     1, keepdim=
     True)
    
   

   
    
   
   
    
             v_b_avg = torch.sum(v_bj * hypotheses_mask.unsqueeze(
     1)
    
   

   
    
   
   
    
                                 .transpose(
     2, 
     1), dim=
     1) \
    
   

   
    
   
   
    
                       / torch.sum(hypotheses_mask, dim=
     1, keepdim=
     True)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             v_a_max, _ = replace_masked(v_ai, premises_mask, 
     -1e7).max(dim=
     1)
    
   

   
    
   
   
    
             v_b_max, _ = replace_masked(v_bj, hypotheses_mask, 
     -1e7).max(dim=
     1)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             v = torch.cat([v_a_avg, v_a_max, v_b_avg, v_b_max], dim=
     1)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
             logits = self._classification(v)
    
   

   
    
   
   
            
     # probabilities = nn.functional.softmax(logits, dim=-1)
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     return logits
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     def _init_esim_weights(module):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Initialise the weights of the ESIM model.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
        
     if isinstance(module, nn.Linear):
    
   

   
    
   
   
    
             nn.init.kaiming_normal_(module.weight.data)
    
   

   
    
   
   
    
             nn.init.constant_(module.bias.data, 
     0.0)
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     elif isinstance(module, nn.LSTM):
    
   

   
    
   
   
    
             nn.init.kaiming_normal_(module.weight_ih_l0.data)
    
   

   
    
   
   
    
             nn.init.orthogonal_(module.weight_hh_l0.data)
    
   

   
    
   
   
    
             nn.init.constant_(module.bias_ih_l0.data, 
     0.0)
    
   

   
    
   
   
    
             nn.init.constant_(module.bias_hh_l0.data, 
     0.0)
    
   

   
    
   
   
    
             hidden_size = module.bias_hh_l0.data.shape[
     0] // 
     4
    
   

   
    
   
   
    
             module.bias_hh_l0.data[hidden_size:(
     2 * hidden_size)] = 
     1.0
    
   

   
    
   
   
     
    
   

   
    
   
   
            
     if (module.bidirectional):
    
   

   
    
   
   
    
                 nn.init.kaiming_normal_(module.weight_ih_l0_reverse.data)
    
   

   
    
   
   
    
                 nn.init.orthogonal_(module.weight_hh_l0_reverse.data)
    
   

   
    
   
   
    
                 nn.init.constant_(module.bias_ih_l0_reverse.data, 
     0.0)
    
   

   
    
   
   
    
                 nn.init.constant_(module.bias_hh_l0_reverse.data, 
     0.0)
    
   

   
    
   
   
    
                 module.bias_hh_l0_reverse.data[hidden_size:(
     2 * hidden_size)] = 
     1.0

utils代码：


 
   
    
   
   
     
    
   

   
    
   
   
    
     import torch
    
   

   
    
   
   
    
     import torch.nn 
     as nn
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     # Code widely inspired from:
    
   

   
    
   
   
    
     # https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
    
   

   
    
   
   
    
     def sort_by_seq_lens(batch, sequences_lengths, descending=True):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Sort a batch of padded variable length sequences by length.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Args:
    
   

   
    
   
   
    
             batch: A batch of padded variable length sequences. The batch should
    
   

   
    
   
   
    
                 have the dimensions (batch_size x max_sequence_length x *).
    
   

   
    
   
   
    
             sequences_lengths: A tensor containing the lengths of the sequences in the
    
   

   
    
   
   
    
                 input batch. The tensor should be of size (batch_size).
    
   

   
    
   
   
    
             descending: A boolean value indicating whether to sort the sequences
    
   

   
    
   
   
    
                 by their lengths in descending order. Defaults to True.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Returns:
    
   

   
    
   
   
    
             sorted_batch: A tensor containing the input batch reordered by
    
   

   
    
   
   
    
                 sequences lengths.
    
   

   
    
   
   
    
             sorted_seq_lens: A tensor containing the sorted lengths of the
    
   

   
    
   
   
    
                 sequences in the input batch.
    
   

   
    
   
   
    
             sorting_idx: A tensor containing the indices used to permute the input
    
   

   
    
   
   
    
                 batch in order to get 'sorted_batch'.
    
   

   
    
   
   
    
             restoration_idx: A tensor containing the indices that can be used to
    
   

   
    
   
   
    
                 restore the order of the sequences in 'sorted_batch' so that it
    
   

   
    
   
   
    
                 matches the input batch.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
    
         sorted_seq_lens, sorting_index =\
    
   

   
    
   
   
    
             sequences_lengths.sort(
     0, descending=descending)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
         sorted_batch = batch.index_select(
     0, sorting_index)
    
   

   
    
   
   
     
    
   

   
    
   
   
    
         idx_range =\
    
   

   
    
   
   
    
             sequences_lengths.new_tensor(torch.arange(
     0, len(sequences_lengths)))
    
   

   
    
   
   
    
         _, reverse_mapping = sorting_index.sort(
     0, descending=
     False)
    
   

   
    
   
   
    
         restoration_index = idx_range.index_select(
     0, reverse_mapping)
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     return sorted_batch, sorted_seq_lens, sorting_index, restoration_index
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     def get_mask(sequences_batch, sequences_lengths):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Get the mask for a batch of padded variable length sequences.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Args:
    
   

   
    
   
   
    
             sequences_batch: A batch of padded variable length sequences
    
   

   
    
   
   
    
                 containing word indices. Must be a 2-dimensional tensor of size
    
   

   
    
   
   
    
                 (batch, sequence).
    
   

   
    
   
   
    
             sequences_lengths: A tensor containing the lengths of the sequences in
    
   

   
    
   
   
    
                 'sequences_batch'. Must be of size (batch).
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Returns:
    
   

   
    
   
   
    
             A mask of size (batch, max_sequence_length), where max_sequence_length
    
   

   
    
   
   
    
             is the length of the longest sequence in the batch.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
    
         batch_size = sequences_batch.size()[
     0]
    
   

   
    
   
   
    
         max_length = torch.max(sequences_lengths)
    
   

   
    
   
   
    
         mask = torch.ones(batch_size, max_length, dtype=torch.float)
    
   

   
    
   
   
    
         mask[sequences_batch[:, :max_length] == 
     0] = 
     0.0
    
   

   
    
   
   
        
     return mask
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     def get_atten_mask(sequences_batch, sequences_lengths):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Get the mask for atten.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Args:
    
   

   
    
   
   
    
             sequences_batch: A batch of padded variable length sequences
    
   

   
    
   
   
    
                 containing word indices. Must be a 2-dimensional tensor of size
    
   

   
    
   
   
    
                 (batch, sequence).
    
   

   
    
   
   
    
             sequences_lengths: A tensor containing the lengths of the sequences in
    
   

   
    
   
   
    
                 'sequences_batch'. Must be of size (batch).
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Returns:
    
   

   
    
   
   
    
             A mask of size (batch, max_sequence_length), where max_sequence_length
    
   

   
    
   
   
    
             is the length of the longest sequence in the batch.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
    
         batch_size = sequences_batch.size()[
     0]
    
   

   
    
   
   
    
         max_length = torch.max(sequences_lengths)
    
   

   
    
   
   
    
         mask = torch.zeros(batch_size, max_length, dtype=torch.float)
    
   

   
    
   
   
    
         mask[sequences_batch[:, :max_length] == 
     0] = 
     1.0
    
   

   
    
   
   
        
     return mask
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     # Code widely inspired from:
    
   

   
    
   
   
    
     # https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
    
   

   
    
   
   
    
     def masked_softmax(tensor, mask):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Apply a masked softmax on the last dimension of a tensor.
    
   

   
    
   
   
    
         The input tensor and mask should be of size (batch, *, sequence_length).
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Args:
    
   

   
    
   
   
    
             tensor: The tensor on which the softmax function must be applied along
    
   

   
    
   
   
    
                 the last dimension.
    
   

   
    
   
   
    
             mask: A mask of the same size as the tensor with 0s in the positions of
    
   

   
    
   
   
    
                 the values that must be masked and 1s everywhere else.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Returns:
    
   

   
    
   
   
    
             A tensor of the same size as the inputs containing the result of the
    
   

   
    
   
   
    
             softmax.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
    
         tensor_shape = tensor.size()
    
   

   
    
   
   
    
         reshaped_tensor = tensor.view(
     -1, tensor_shape[
     -1])
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     # Reshape the mask so it matches the size of the input tensor.
    
   

   
    
   
   
        
     while mask.dim() < tensor.dim():
    
   

   
    
   
   
    
             mask = mask.unsqueeze(
     1)
    
   

   
    
   
   
    
         mask = mask.expand_as(tensor).contiguous().float()
    
   

   
    
   
   
    
         reshaped_mask = mask.view(
     -1, mask.size()[
     -1])
    
   

   
    
   
   
     
    
   

   
    
   
   
    
         result = nn.functional.softmax(reshaped_tensor * reshaped_mask, dim=
     -1)
    
   

   
    
   
   
    
         result = result * reshaped_mask
    
   

   
    
   
   
        
     # 1e-13 is added to avoid divisions by zero.
    
   

   
    
   
   
    
         result = result / (result.sum(dim=
     -1, keepdim=
     True) + 
     1e-13)
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     return result.view(*tensor_shape)
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     # Code widely inspired from:
    
   

   
    
   
   
    
     # https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
    
   

   
    
   
   
    
     def weighted_sum(tensor, weights, mask):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Apply a weighted sum on the vectors along the last dimension of 'tensor',
    
   

   
    
   
   
    
         and mask the vectors in the result with 'mask'.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Args:
    
   

   
    
   
   
    
             tensor: A tensor of vectors on which a weighted sum must be applied.
    
   

   
    
   
   
    
             weights: The weights to use in the weighted sum.
    
   

   
    
   
   
    
             mask: A mask to apply on the result of the weighted sum.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Returns:
    
   

   
    
   
   
    
             A new tensor containing the result of the weighted sum after the mask
    
   

   
    
   
   
    
             has been applied on it.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
    
         weighted_sum = weights.bmm(tensor)
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     while mask.dim() < weighted_sum.dim():
    
   

   
    
   
   
    
             mask = mask.unsqueeze(
     1)
    
   

   
    
   
   
    
         mask = mask.transpose(
     -1, 
     -2)
    
   

   
    
   
   
    
         mask = mask.expand_as(weighted_sum).contiguous().float()
    
   

   
    
   
   
     
    
   

   
    
   
   
        
     return weighted_sum * mask
    
   

   
    
   
   
     
    
   

   
    
   
   
     
    
   

   
    
   
   
    
     # Code inspired from:
    
   

   
    
   
   
    
     # https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py.
    
   

   
    
   
   
    
     def replace_masked(tensor, mask, value):
    
   

   
    
   
   
        
     """
    
   

   
    
   
   
    
         Replace the all the values of vectors in 'tensor' that are masked in
    
   

   
    
   
   
    
         'masked' by 'value'.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Args:
    
   

   
    
   
   
    
             tensor: The tensor in which the masked vectors must have their values
    
   

   
    
   
   
    
                 replaced.
    
   

   
    
   
   
    
             mask: A mask indicating the vectors which must have their values
    
   

   
    
   
   
    
                 replaced.
    
   

   
    
   
   
    
             value: The value to place in the masked vectors of 'tensor'.
    
   

   
    
   
   
    
     
    
   

   
    
   
   
    
         Returns:
    
   

   
    
   
   
    
             A new tensor of the same size as 'tensor' where the values of the
    
   

   
    
   
   
    
             vectors masked in 'mask' were replaced by 'value'.
    
   

   
    
   
   
    
         """
    
   

   
    
   
   
    
         mask = mask.unsqueeze(
     1).transpose(
     2, 
     1)
    
   

   
    
   
   
    
         reverse_mask = 
     1.0 - mask
    
   

   
    
   
   
    
         values_to_add = value * reverse_mask
    
   

   
    
   
   
        
     return tensor * mask + values_to_add