0% found this document useful (0 votes)
3 views

Mathematics 12 00997

This article presents a novel approach to Neural Machine Translation (NMT) by introducing the Content-Adaptive Recurrent Unit (CARU) in both a gated attention layer (CAAtt) and an embedding layer (CAEmbed). The proposed model addresses long-term dependency issues and enhances translation accuracy by refining context vectors and attention weights, as demonstrated through experiments on WMT14, WMT17, and Multi30k datasets. Results indicate significant improvements in BLEU scores and convergence speed compared to traditional attention-based NMT models.

Uploaded by

Muskan Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Mathematics 12 00997

This article presents a novel approach to Neural Machine Translation (NMT) by introducing the Content-Adaptive Recurrent Unit (CARU) in both a gated attention layer (CAAtt) and an embedding layer (CAEmbed). The proposed model addresses long-term dependency issues and enhances translation accuracy by refining context vectors and attention weights, as demonstrated through experiments on WMT14, WMT17, and Multi30k datasets. Results indicate significant improvements in BLEU scores and convergence speed compared to traditional attention-based NMT models.

Uploaded by

Muskan Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

mathematics

Article
Neural Machine Translation with CARU-Embedding Layer and
CARU-Gated Attention Layer
Sio-Kei Im 1,2 and Ka-Hou Chan 1,2, *

1 Faculty of Applied Sciences, Macao Polytechnic University, Macau, China; [email protected]


2 Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence,
Macao Polytechnic University, Macau, China
* Correspondence: [email protected]

Abstract: The attention mechanism performs well for the Neural Machine Translation (NMT) task,
but heavily depends on the context vectors generated by the attention network to predict target
words. This reliance raises the issue of long-term dependencies. Indeed, it is very common to combine
predicates with postpositions in sentences, and the same predicate may have different meanings
when combined with different postpositions. This usually poses an additional challenge to the NMT
study. In this work, we observe that the embedding vectors of different target tokens can be classified
by part-of-speech, thus we analyze the Natural Language Processing (NLP) related Content-Adaptive
Recurrent Unit (CARU) unit and apply it to our attention model (CAAtt) and embedding layer
(CAEmbed). By encoding the source sentence with the current decoded feature through the CARU,
CAAtt is capable of achieving translation content-adaptive representations, which attention weights
are contributed and enhanced by our proposed L1 (exp( Nx )) normalization. Furthermore, CAEmbed
aims to alleviate long-term dependencies in the target language through partial recurrent design,
performing the feature extraction in a local perspective. Experiments on the WMT14, WMT17, and
Multi30k translation tasks show that the proposed model achieves improvements in BLEU scores
and enhancement of convergence over the attention-based plain NMT model. We also investigate
the attention weights generated by the proposed approaches, which indicate that refinement over
the different combinations of adposition can lead to different interpretations. Specifically, this work
provides local attention to some specific phrases translated in our experiment. The results demonstrate
that our approach is effective in improving performance and achieving a more reasonable attention
Citation: Im, S.-K.; Chan, K.-H. distribution compared to the state-of-the-art models.
Neural Machine Translation with
CARU-Embedding Layer and
Keywords: neural network; Neural Machine Translation (NMT); Natural Language Processing (NLP);
CARU-Gated Attention Layer.
attention mechanism; Content-Adaptive Recurrent Unit (CARU)
Mathematics 2024, 12, 997. https://
doi.org/10.3390/math12070997
MSC: 68T07; 68T50
Academic Editor: Danilo Costarelli

Received: 15 February 2024


Revised: 11 March 2024
1. Introduction
Accepted: 18 March 2024
Published: 27 March 2024 Neural Machine Translation (NMT) has garnered considerable attention in recent years
as it allows for a large, single, end-to-end trainable neural network for translation [1,2].
The majority of proposed NMT models are part of the encoder-decoder family, which
incorporates an encoder and a decoder for each language, or utilizes language-specific
Copyright: © 2024 by the authors.
encoders that process sentences and present comparable outputs [3,4]. The neural network
Licensee MDPI, Basel, Switzerland.
for encoding reads the source sentence and transforms it into a sentence-level vector.
This article is an open access article
The decoder is then responsible for generating the translation based on the encoded vector.
distributed under the terms and
The whole system, which involves the encoder and decoder of a language pair, is mutually
conditions of the Creative Commons
trained to boost the likelihood of accurate translation for a given source sentence. Advanced
Attribution (CC BY) license (https://
technology in neural machine translation is the attention mechanism [5]. This mechanism,
creativecommons.org/licenses/by/
4.0/).
including Transformer [6] and Vanilla Attention [7], acts as an information bridge between

Mathematics 2024, 12, 997. https://doi.org/10.3390/math12070997 https://www.mdpi.com/journal/mathematics


Mathematics 2024, 12, 997 2 of 19

the encoder and decoder. By dynamically detecting the pertinent source word to predict the
forthcoming target word, it generates a context vector. Intuitively, different target words
will align to different source words which results in varied context vectors during decoding.
These context vectors must be discriminatory enough to predict the target words accurately,
or the same target words may be repeatedly generated. However, this is typically not the
case in practice, even when the attended source words are relevant. We note that the context
vectors are highly comparable to each other, with minor variations in each dimension across
decoding steps. This indicates that the Vanilla Attention mechanism does not accurately
differentiate between various translation predictions. We believe that the explanation for
this is linked to the configuration of the attention mechanism, which provides a weighted
sum of source representations (hidden features from the encoder) that are invariant across
the decoding step.
Figure 1 illustrates the enhancement in the quality of translation over the past few
years, as measured by the BLEU (The most commonly used metric for evaluating ma-
chine translation systems is the BLEU-4.) score [8], along with the corresponding NMT
model. The performance is greatly improved by applying the attention approach under
high-resource conditions (One of the most popular datasets used to benchmark machine
translation systems is the WMT family of datasets, most referenced to WMT14 and WMT16).
Therefore, we aim to explore the practicality of extending attention models for enhanc-
ing the capacity of these vectors. This promising approach can significantly enhance the
translation performance of the model by boosting its discriminative capability.

34
Attention + Transformer Cycle (Rev)

Attention + Rep(Uni)
32

T5-11B + Attention
30
SCORE

28 Evolved Transformer Big

Transformer + Relative Position

26 Transformer Big

24
2018 2019 2020 2021

Other models Models with highest BLEU-4 score

Figure 1. BLEU-4 scores for various NMT models. The “Attention” architectures have contributed
to major improvements in machine translation. The quality of the attention-related models outper-
formed the other models, with only the attention approach scoring greater than 30.0 (as measured by
case-insensitive BLEU-4).

In this work, we introduce a new CARU-gated attention layer (CAAtt) and a CARU-
Embedding (CAEmbed) layer for decoded word embedding in NMT. The overall framework
of our model, as illustrated in Figure 2, present the structural and feature connections
between the two CAAtt and CAEmbed layers. In particular, CAAtt expands the Vanilla
Attention network by inserting a gating layer based on the concept of Content-Adaptive
Recurrent Unit (CARU) [2]. CARU utilizes the original source representation as its history
and the corresponding previous decoder state as its current input. In this way, CAAtt can
focus on creating source representations that take into account translation effects. This
helps enhance the discriminative power of the context vector, making it more effective in
predicting the subsequent target term. By considering the impact of translation, the model
can better understand the context and make more accurate predictions. Afterward, CAEmbed
enhances word embedding by combining it with part-CARU. This integration involves
processing only the short-term hidden state(s) in a partial loop. This technique optimizes
the key information present in the embedded vector. Specifically, it aims to reduce the
Mathematics 2024, 12, 997 3 of 19

reliance on punctuation and increase the adaptation of relevant keywords. This approach
proves to be particularly advantageous for non-English sentences, where the structure and
grammar can differ significantly from English. It is also beneficial for languages that utilize
postpositions extensively. Through the combination of CAAtt and CAEmbed, the translation
process can be improved, leading to more accurate and effective results.

x0 x1 x2 x3

⃗h0 ⃗h1 ⃗h2 ⃗h3

⃗h0 ⃗h1 ⃗h2 ⃗h3

CARU-Gated Network CAAtt Layer

⃗hg ⃗hg ⃗hg ⃗hg


0 1 2 3

ATT(CAAtt (h, ⃗st−1 ) , ⃗st−1 ) × CAAtt (h, ⃗st−1 )

part-CARU part-CARU part-CARU part-CARU CAEmbed Layer

⃗s0 ⃗s1 ⃗s2 ⃗s3

y0 y1 y2 y3

Figure 2. An overview of the Vanilla Attention network with proposed layers of CAAtt and CAEmbed.
The source and target sides are denoted by blue and yellow, and the green and orange colors represent
the (embedded) information flow for target word prediction and attention, respectively. The red color
indicates the CARU-gated layer. “ATT” represents the procedure for calculating attention weights.
⃗st−1 refers to the previous decoder state, corresponding to the current step t. More details about the
equations can be found in Sections 4 and 5.

Considering that CARU has the ability to control/adjust the feature flow between
weight and current hidden state using its content-adaptive gate and update gate, we pro-
pose an adaptation of CAAtt that takes into account the previous decoder’s ⃗s<t state as
short-term history and the original source representation as the input feature. Furthermore,
to enhance word prediction accuracy, a context-search of part-CARU has been incorporated.
Both proposed layers are straightforward and efficient for the training and decoding of
NMT. The validation is conducted on both the Multi30k and WMT14 datasets to assess
their performance in English–German translation tasks. The experimental results report
that the proposed models outperform the attention-based plain NMT significantly. The gen-
erated attention weights and context vectors were also scrutinized, demonstrating that the
attention weights are more precise and context vectors are more discriminative.
The proposed model enhances the widely used attention networks and also improves
its convergence speed and performance. Our contributions are summarised as follows:
• We investigate the features of the context vectors produced by the Vanilla Attention
model to fully analyze their ability to discriminate between discourse and part-of-
speech, and find that the underlying case is that decoding invariant source representa-
Mathematics 2024, 12, 997 4 of 19

tions such as the weight of punctuation usually tends to dilute the information in the
entire sentence. We develop a CARU-gated attention (CAAtt) layer that dynamically
adjusts and refines the source representation based on partial translations.
• In order to increase the convergence speed, we introduce a normalization method and
involve it in the calculation of the attention weights. We also analyze its performance
and give a complete derivation procedure. Compared to Softmax, it provides stronger
gradients when there are numerous categories being predicted.
• Besides, the inference of the current predicate is also highly correlated with the next
word. In particular, various combinations of adposition can lead to different interpre-
tations. We introduce an alternative layer (CAEmbed) consisting of embedding and the
proposed part-CARU. It aims at weighting the current embedded vector in order to
enhance the consistency of target sentences.
• Several experiments on English–German translation tasks have been conducted. The re-
sults show that our model outperforms the baseline in dealing with phrases, generating
accurate attention weights, expanding the variability of the context vector, and im-
proving translation quality.

2. Related Work
In the early years, Sequence-to-Sequence (Seq2Seq) learning models are originally
proposed in the simple absence of an attention mechanism, relying mainly on an encoder
to project all features of the semantic details from the source-side into a fixed-length
vector [3,9,10]. Refs. [5,11] indicate that using a fixed-length vector is insufficient for
representing natural phrases. To address this, they introduced the attention mechanism
for neural machine translation, which enables automatic search for the part of the source
sentence, relevant to the next target word to be predicted. In practice, attention-based
models have demonstrated substantial improvements, outperforming other models in
machine translation tasks [12]. Also, ref. [13] investigates several effective approaches to
attentional weighting functions, applying local and global attention models in one-shot
modality, and then [14,15] enhance the attention-based NMT model by incorporating
information from multiple modalities. Refs. [16,17] introduces a coverage vector in order
to keep track of the history of attentional features, which allows the attentional network
to pay more intelligence to represent the source sentences, and also a coverage model
to track the coverage status using a full-coverage embedded vector [18,19]. In addition,
refs. [20,21] introduces the self-attention approach, which reduces the number of sequential
computations by short paths between distant words in entire paragraphs. It concludes that
these short paths are particularly useful for learning strong semantic feature extractors.
To supplement its long-term issue between sentences, ref. [22] brings it into recursion along
the context vector to help adjust future attention. Ref. [23] extends the (cross-)attention
mechanism by a recurrent connection to allow direct access to previous alignment decisions,
which incorporates several structural biases to improve the attention-based model by
involving Markov conditions, fertility, and consistency in the direction of translation [24,25].
Refs. [26,27] proposes deep attention based on the low-level attentional information that
can automatically determine the refinement of attentional weights in a layer-wire manner.
Currently, refs. [28,29] proposes a self-attention deep NMT model with a text mining
approach to identify the vulnerability category from paragraphs. All these models tend to
investigate the variability of context vectors by generating more accurate attention weights.
Instead, our approach enables dynamic adjustment and resorting to source representations
based on partial translations. This is distinct from existing literature and can be integrated to
enhance context vectors’ discriminative power. Additionally, the autoregressive model [30]
is another closely related approach, which introduces a better way to train the NMT model
by using the regression of features on themselves and directly supervising the attention
weights of the NMT using well-trained word alignment. Also, ref. [31] treats the source
representation as a memory/feature and models it when dealing with different time series
over a long-term time, providing significant flexibility in dealing with the decoder and
Mathematics 2024, 12, 997 5 of 19

the various different time-series patterns of that memory during translation. Moreover,
an advanced approach to Deep Reinforcement Learning (DRL) algorithms can also be
applied to NLP [32], such as the Q-Learning has demonstrated significant potential for
solving the NMT [33,34].
Inspired by the above approach, our proposed model treats the source representation
as a memory/feature and simulates the interaction between the decoder and this memory
through read and write operations during the translation process. This work aims to
highlight that employing the proposed composition of CARU (CAAtt and CAEmbed) in our
model can improve the efficiency of training and decoding, rather than merely addressing
the interactive attention content from the read operation. CAAtt can be seen as an extension
of Vanilla Attention, offering the advantage of alleviating gradient vanishing or exploding
problems during training [35,36]. It also simplifies LSTM [37,38] and presents a variant
design for GRU [39,40], enabling efficient computation. Additionally, the gate mechanism
in CAAtt is built on CARU. Consistency in language use is maintained throughout the
document. CARU is a recurrent unit that utilizes a content-adaptive gate and an update
gate to manage the importance of information from the hidden states and the current input
feature, respectively [2]. Within NMT, CARU has been proposed as both an encoder and
decoder function for the Seq2Seq model [2], and the option of using a gate design as an
alternative to the attention mechanism has also been presented [41,42]. To our knowledge,
the use of CARU as a gate mechanism has not been investigated previously.

3. Background
Similar to the basic Seq2Seq structure, the Vanilla Attention mechanism introduces
a context vector/module between the encoder and decoder. This context vector aims to
collect the output of all units as input features to compute the probability distribution of
the source language (embedded) words for each feature word that the decoder wants to
predicate. By applying this mechanism, the aim is to discover the relationship between the
encoder and decoder. This connection enables the decoder to capture global information
to a certain extent, instead of inferring only from the current hidden state. The attention
module identifies the source words pertinent to the succeeding target word and assigns
them with high attention weights while calculating the context vector ⃗ct by:

⃗ct = ATT(h,⃗st−1 ) × h (1)

where ATT denotes the alignment function that uses anfeedforward oneural network to
calculate the attention weights ⃗αt,i of encoder states h = ⃗h0 ,⃗h1 , · · · ,⃗ht with the previous
decoder state ⃗st−1 . For a target word instance, it first reviews overall encoder states to
compare the target and source word with the aim of computing a score for each state of
the encoder. Next, a Softmax function normalizes all scores and produces a probability
distribution conditional on the target word for determination, as follows:

exp(⃗et,i )
⃗αt,i = (2)
∑k exp(⃗et,k )

where the  relevance


h score ⃗et,i is estimated via an alignment equation proposed by [5]:
i

⃗et,i = ϕ Wα ⃗st−1 |hi , here ϕ is the activation function of hyperbolic tangent. Intu-
itively, the higher the attention weights, the more important the index (determined by
arg maxi (⃗αt,i )) is for predicting the next word. Therefore, according to the Softmax function
that ensures the sum of the elements of vector ⃗αt,i is equal to 1, the attention model produces
the final context vector ⃗ct by weighting the encoder states h directly by:

⃗ct = ∑⃗αt,i⃗hi (3)


i
Mathematics 2024, 12, 997 6 of 19

Although this model outperforms better than the others, we find that (2) employs the
Softmax function meaning that the Vanilla Attention mechanism also inherits the shortcom-
ings of Softmax during the training process [43,44]: The convergence in the later stages
becomes very slow and the resulting context vectors are very similar to each other, making
them insufficiently discriminative. We conduct to improve the prediction and enhance the
convergence in the following sections.

4. CARU-Gated Attention
The Vanilla Attention mechanism is used to overcome the NMT problem by allow-
ing the network to return to the input sequence instead of encoding all the information
into a fixed-length vector, but it still has some shortcomings we mentioned before. This
mechanism allows the embedded vector to access its internal memory/feature, which is
the hidden states generated by the encoder. In this interpretation, the network chooses to
retrieve something from the current feature of this encoder state, rather than considering
whether and how much it should be attended. With this in mind, we attempt to investigate
the reasons behind this by analyzing the regression of the context vector ⃗ct in (3). Achieving
a one-to-one mapping between two different languages in a translation process is difficult,
not just for NMT, but also for SMT. Source words often align with multiple target words,
resulting in different sentence structures and sequences. This dilutes attention weights due
to the decoding steps between the source word and its corresponding target word position.
According to (1), the context vector ⃗ct comes up with decoder state ⃗st−1 , which is tem-
porarily activated in the (n − 1)-th recurrent step, and the attention weights is completely
dominated by encoder states h by (2). In practice, the attention mechanism in NMT learns
in an unsupervised manner without explicit prior knowledge about alignment, and there is
always a lot of redundant information dragging the alignment, such as punctuation and
conjunctions that often dilute the information of the encoder states h. Theoretically, this
redundant information can be identified from their part-of-speech (also determined by
pre-processing). We can thus make use of these attempts to reduce the proportion/weight
of redundant information in the encoder, thus making the major words/features more
obvious for subsequent consideration.
In order to achieve this improvement, CARU’s approach is well worth referring to in
our design. In each decoding step, these redundant features are mitigated by the proposed
CAAtt layer before the encoder states are input to the Vanilla Attention model. There are
two objectives that this proposed gate layer must achieve:
1. Being able to dynamically adjust the weight of each embedded word in a recurrent step,
the adjusted encoder state should be more meaningful and represent the translation
context clearly, so that the decoder can extract useful context vectors for attention.
2. We conduct to maintain the intention of (1) and therefore recommend that the feature
size and dimension produced by this layer must equal to that of the encoder state
(but the length is allowed to change), otherwise it will cause data sparsity problems,
making it unstable and hard to convergence.
In view of these considerations, the CARU-Gated layer is emitted, which dynamically
adapts h according to the previous decoder state ⃗st−1 employing a gate-controlled network,
expressed as:
g = CAAtt(h,⃗st−1 ) (4)
and the proposed layer then evolves by (1) to:

⃗ct = ATT( g,⃗st−1 ) × g


= ATT(CAAtt(h,⃗st−1 ),⃗st−1 ) × CAAtt(h,⃗st−1 )

For the language/context vectors, humans/learning models can immediately capture the
meaning or pattern if they have learned it before, respectively. Corresponding to NLP, for a
standard sentence, its subject, verb, and object should receive more attention before other
Mathematics 2024, 12, 997 7 of 19

auxiliary information is considered. This process can be seen as a weighted assignment of


words in a sentence. Thus, by employing CAAtt, the previous decoder state ⃗st−1 is coupled
to the encoder states h in order to determine its weight within g, which also can be seen as
a tagging task in the proposed layer.

4.1. Gating Architecture


The features of CARU aim to alleviate the long-term dependence problem through
its content-adaptive gate, which weights the received states according to their categories
(such as part-of-speech). We study the gate control and investigate the data flow of CARU,
attempting to treat the previous decoder state ⃗st−1 as an input feature and refine it by
recurring all encoder state h.
As presented in Figure 3, for t = 0, CAAtt directly returns ⃗g = W [⃗st−1 ] + B Initially,
next for t > 1, a complete architecture is given by:

⃗xt,i = W [⃗st−1 ] + B (5a)


 h i 
⃗nt,i = ϕ ⃗xt,i + Wn ⃗hi + Bn (5b)
 h i 
⃗zt,i = σ Wz ⃗st−1 |⃗hi + Bz (5c)
⃗lt,i = σ(⃗xt,i ) ⊙⃗zt,i (5d)
 
⃗gt,i = 1̄ − ⃗lt,i ⊙ ⃗hi + ⃗lt,i ⊙ ⃗nt,i (5e)

where ⊙ denotes the Hadamard product, and σ and ϕ denote the activation function of
sigmoid and hyperbolic tangent, respectively. It can be found that there are two data flows
(⃗nt,i and ⃗zt,i ) aimed to perform the linear combination of the current encoder state ⃗hi and
previous decoder state ⃗st−1 parallelly. (5b) is referred to as the design of an update gate,
which results in ⃗nt,i having been connected to the end of the combination of weights with
encoder states. Besides, (5c) determines to what extent the original source information
can be used to combine partial translations. In particular, (5d) is the CARU feature of the
context-adaptive gate that defines how much of the original source information can be
retained. The proposed CAAtt also incorporates the advantage of CARU that the σ (⃗xt,i )
in (5d) can be considered as a tagging task that connects the relation between the weight
and part-of-speech. For instance, the word-weight σ (⃗xt,i ) should be close to zero if the ⃗st−1
representing a punctuation (such as “full stop”), implied as:

σ (⃗xt,i ) → 0̄
=⇒ ⃗lt,i = σ(⃗xt,i ) ⊙⃗zt,i → 0̄
 
=⇒ ⃗gt,i = 1̄ − ⃗lt,i ⊙ ⃗hi + ⃗lt,i ⊙ ⃗nt,i → ⃗hi

where the result of the content-adaptive gate⃗lt,i will be close to zero regardless of the content
weight ⃗zt,i , which means that the produced state ⃗gt,i will also converge to the encoder state
⃗hi . It shows that low-weight words have less influence on the output. Finally, the use
of linear interpolation between ⃗hi and ⃗nt,i ensures that the refined source representations
satisfy the requirement above-mentioned: It can alleviate the complex interactions between
source sentences and partial translations, allowing CAAtt to efficiently control the matching
and data flow between them.
Mathematics 2024, 12, 997 8 of 19

ht ⊙ ⊕ ht+1

Content-Adaptive Gate
1−

⊕ σ ⊙ ⊙

⊕ ϕ

st

Figure 3. An overview of gating architecture in CARU.

4.2. Attention Weights Normalisation


Once the ATT architecture is confirmed, we further aim to improve the performance
of attention weights normalization. As mentioned at the end of Section 3, the attention
weights is generally achieved by the a Softmax function, but the convergence is slow in
practice. Therefore, we conduct to discover a better normalization function in order to
enhance the convergence of this component:
Our derivation begins by investigating the form of the L1 -normalisation, for any f ( x ),
we have:
f ( xi )
L1 ( f ( xi )) =
∑k f ( xk )
where k from 1 to N, and N denotes the number of categories, respecting the partial
derivation,
  
1 f ( xi ) ∂ f ( xi )
1− , i=j


∑k f ( xk ) ∑k f ( xk )

∂  ∂x j
L ( f ( xi )) = . (6)
∂x j 1
 
1 f ( xi ) ∂ f ( xi )
− ̸


 , i = j
∑k f ( xk ) ∑k f ( xk )

∂x j

Note that the L1 ( f 1 ( xi )) becomes the Softmax function if f 1 ( x ) = exp( x ). In contrast,


the normalisation method we propose here is L2 ( f 2 ( xi )) with f 2 ( x ) = exp( Nx ). Since
our goal is to enhance the convergence, we must find a condition that satisfies with
∂x L1 ( f 2 ( xi )) ≥ ∂x L1 ( f 1 ( xi )), as follows:
∂ ∂
j j

    
1 f 2 ( xi ) ∂ f 2 ( xi ) 1 f 1 ( xi ) ∂ f 1 ( xi )
 ∑ f2 (x ) 1 − ∑ f2 (x ) ≥ 1−


∑k 1 ( k )
f x ∑k 1 ( k )
f x

k k k k ∂x j ∂x j
    .
1 f (x ) ∂ f 2 ( xi ) 1 f (x ) ∂ f 1 ( xi )
− 2 i ≥ − 1 i



∑k f 2 ( xk ) ∑k f 2 ( xk ) ∑k f 1 ( xk ) ∑k f 1 ( xk )

∂x j ∂x j

By definition, f 1 ( x ) = exp( x ) ensures that its range belongs to R+ , and the range of the
proposed f 2 ( x ) = exp( Nx ) is also belong to R+ . We simplify these inequalities and obtain,

∑ k f 1 ( x k ) ∑ k f 1 ( x k ) ∂ f 2 ( xi ) ∑ k f 1 ( x k ) − f 1 ( xi ) ∂ f 1 ( xi )
    


 ∑ f2 (x ) ∑ f2 (x ) ≥
∑ k f 2 ( x k ) − f 2 ( xi )

k k k k ∂x j ∂x j
.
∑ k f 1 ( x k ) ∑ k f 1 ( x k ) ∂ f 2 ( xi )
   
f 1 ( xi ) ∂ f 1 ( xi )




f 2 ( xi ) ∑k f 2 ( xk ) ∑k f 2 ( xk )

∂x j ∂x j

Rewrite as:

∑ f ( x ) ∑ k f 1 ( x k ) ∂ f 2 ( xi ) ∑ k f 1 ( x k ) − f 1 ( xi ) ∂ f 1 ( xi )
     
f 1 ( xi ) ∂ f 1 ( xi )
≥ k 1 k ≥ . (7)
f 2 ( xi ) ∂x j ∑k f 2 ( xk ) ∑k f 2 ( xk ) ∂x j ∑ k f 2 ( x k ) − f 2 ( xi ) ∂x j
Mathematics 2024, 12, 997 9 of 19

It thus can be found that (7) leads to the necessary condition as:

∑ k f 1 ( x k ) − f 1 ( xi ) ∂ f 1 ( xi ) ∑ f (x ) ∑ f (x )
   
f 1 ( xi ) ∂ f 1 ( xi )
≥ =⇒ k 2 k ≥ k 1 k .
f 2 ( xi ) ∂x j ∑ k f 2 ( x k ) − f 2 ( xi ) ∂x j f 2 ( xi ) f 1 ( xi )

By substituting f 1 ( x ) = exp( x ) and f 2 ( x ) = exp( Nx ), the final inequality become:

∑ exp( N (xk − xi )) ≥ ∑ exp((xk − xi )).


k k

Since exp( x ) is a monotonically increasing function, thus it requires that N ≥ 1 in order


to satisfy with above inequities. In fact, it is worth mention that N denotes the number of
categories that means N is always bigger then 1. Therefore, by using our normalisation
method L1 (exp( Nx )) with necessary condition N ≥ 1, the enhancement of convergence
becomes more and more obvious with the increasing gradient.
Theoretically, Softmax normalization tends to project features into a probability domain,
which is sufficient for extracting target features for the classification problem. However,
for the computing of attention weights, the purpose of weighting is not only to determine
the maximum value, but other vectors also require to be calculated with an appropriate
weight in order to merge them into the context vector. Intuitively, corresponding to (6),
the classification problem aims at increasing the gradient as long as i = j, while in attention
weights, it must be taken into account that the gradient is also increased in the case i ̸= j.
In order to verify them, we can substitute f 2 ( x ) = exp( Nx ) into (6), as follows:
(
∂ L1 ( f 2 ( xi ))(1 − L1 ( f 2 ( xi ))), i = j
L ( f 2 ( xi )) = N ×
∂x j 1 L1 ( f 2 ( xi ))(− L1 ( f 2 ( xi ))), i ̸= j

It is obvious that N always increases their gradient as long as N > 1 whether i is equal
to j or not. Based on the above justification, the proposed normalization method has the
ability to integrate into the attention weights ⃗αt,i , and outperforms the Softmax function as
(2). Moreover, with respect to (6), it is worth mentioning that the result of ∑ j ∂x∂ L1 ( f ( x )) is
j
always equal to zero, which means that the normalization procedure (whether proposed,
Softmax or others) only contributes to the convergence within the attention weights, allow-
ing for faster training, but providing no additional gradient to the final accuracy of the
overall network.

5. Partial CARU for Embedding


In natural language, it is very common to combine predicates with postpositions in
sentences, and the same predicate may have different meanings when combined with
various postpositions, such as “look after” and “look for”. In addition, some specific
phrases have other meanings like “get axed” and “break a leg” meaning “retrenched” and
“good luck”, respectively, but their literal meaning is irrelevant to the expressed meaning.
There are also grammatically unsound phrases such as “state of the art” and “long time no
see”, but people still use them from time to time. In addition, for the NMT pre-processing,
word tokenization always decomposes phrases into separate words for regularising the
input data. Therefore, when the phrase is translated as “look for”, NMT usually alternatives
it with similar word meanings like “find”, “search”, “discover”, etc. The representation of
sentence meaning is acceptable in terms of comprehension but unfavorable in the metric
of BLEU score. It is very challenging for NMT to fully grasp the phrase. Given this, we
introduce the CAEmbed, a novel additive layer for context-embedding, stacked in the middle
of the attention network and the decoder states (corresponding to Figure 2). It aims to
coordinate the inference of the current predicate in order to improve accuracy by analyzing
the previous decoder state(s).
Mathematics 2024, 12, 997 10 of 19

Conditional Probability Expressions


Briefly, the fundamental work of the partial recurrent design is to predict a given
sequence of previously finite words in a probabilistic approach, where the length of the
sequence to be recurred is controlled by p. The CAEmbed probability of each word on the
target side can be expressed as below, starting from P(⃗s1 |⃗c1 ):

t
P(⃗s1:t |⃗c1:t ) = ∏ P(⃗si |⃗s1:i−1 , ⃗c1:t ). (8)
i=t− p

It can be discovered that the conditional probability of the current ⃗st is only affected by
p + 1 vectors, unlike the benchmark NMT approach that always takes into account all
previous ⃗s<t , which often leads to the long-term issue. Therefore, the probabilities obtained
from (8) can achieve more accurate results than others. Besides, the context vector ⃗c1 must
be entered as the last input factor because it is encoded by the attention model and it also
contains quite important information about the hidden state. We arrange it as the last input
in order to prevent it from being diluted by other input features that follow. Intuitively,
the proposed layer of  CAEmbed consists of one CARU that partially encodes the previous p
embedded vector s ∈ ⃗st− p , · · · ,⃗st−1 and the context vector ⃗ct generated by the attention
model. A complete algorithm (Algorithm 1) of CAEmbed is as follows:

Algorithm 1: Pseudo code of CAEmbed layer architecture, with regard to Figure 2.


1 def CAEmbed
 :
2 s ← ⃗st− p , · · · ,⃗st−1
3 p ← |s|
4 foreach ⃗s ∈ s do 
5 ⃗ht− p ← CARU ⃗st− p ,⃗ht− p−1
6 p ← p−1 ▷p≥1
7 end
 
8 ⃗ht ← CARU ⃗ct ,⃗ht−1 ▷p=0
9 return ⃗ht
10 end def

As presented in Algorithm 1, the CAEmbed layer received the previous p features,


which perform a partial convolution aiming to achieve local feature extraction and also
alleviate the long-term dependency problem. Since there are p recurrently in the CAEmbed,
the time complexity can be roughly approximated by O( p) during encoding. Besides,
there is one CAEmbed layer stacked in our approach with respect to Figure 2. The memory
is mainly required by three linear layers within the CARU, that is (W + B), (Wn + Bn )
and (Wz + Bz ), the space complexity thus can be considered as O(W + B). In practice,
performing partial sentence recurrent can alleviate the pressure on the unit compared to
entire sentence recurrent, and part-CARU has good compatibility with phrases. In practice,
the number of words in a phrase is usually 2, thus we prefer a value of 1 for p, which means
the CAEmbed only performs analysis of the current ⃗st−1 and ⃗ct . Recall that CARU has the
function of part-of-speech analysis, which is also useful in postposition analysis. This is
because, in any language, postpositions are always followed by/connected to nouns or
verbs. According to this rule, CARU can ignore another word through the word-weight
σ (⃗xt,i ) generated by the content-adaptive gate (corresponding to Section 4.1). Furthermore,
RNN-based models always face the problem of data transfer and storage, and are also
challenged by the problem of long-term dependency during training. As CAEmbed performs
only short loops in the middle of the sentence instead of receiving the whole sequence. This
approach enables the processing of predicates with local attention, releasing the processes
Mathematics 2024, 12, 997 11 of 19

of (long-term) data transfer between recursive layers, which effectively alleviates data
transfer and storage overhead.

6. Experimental Results and Discussion


The experiments are conducted on the WMT14 and WMT17 English-German datasets
for our translation task simulations. WMT14 (https://www.statmt.org/wmt14/translation-
task.html, accessed on 15 February 2024) provides 1.5M sentence pairs, which include 116M
English words and 110M German words for training. According to the configuration
introduced in [13], we select newstest2013 and newstest2017 as the validation set and test
set, respectively. Next, WMT17 (https://www.statmt.org/wmt17/translation-task.html,
accessed on 15 February 2024) provides 5.58M sentence pairs, which include 141M English
words and 135M German words for training, and we use the concatenation of newstest2014
and newstest2015 as the validation set and newstest2017 as the test set. In practice, our
pre-processing preforms the word tokenization, Truecasing optimization [45], Byte-Pair
Encoding (BPE) segmentation [46] and sequence length restriction for data preparation.
Our model can accept sentences up to 100 words in length as input sequences and discard
tokens with frequencies lower than 4 when building the NMT vocabulary in order to limit
the vocabulary size of the source and target languages to 50M tokens.
In order to implement our models, all NMT models are executed by multiple GPUs
(there are four Nvidia Quadro RTX 5000 with a total 4 × 16.0 = 48.0 GB of video memory
in one machine) with a distributed training strategy. We implement all models based on
the scientific deep learning framework PyTorch [47] with the NLP-related library TorchText
(https://github.com/pytorch/text, accessed on 15 February 2024), which provides the
toolkit for these pre-processing stages mentioned before, so they can reach our requirements
for preparing the experiments. Besides, we allocate the hidden state of 210 feature size to the
both recurrent encoder and decoder. All source and target words are represented by the 28
dimensional embedding vector. All trainable parameters are randomly initialized according
to a normal distribution with mean and standard deviation defaulted to µ = 0 and σ = 0.01,
and employed the advanced gradient-related optimizer Adam [48] method with a learning
rate of 1 × 10−3 . Besides, we also devise a weight scheduler callback to ensure balanced
performance across multiple-layer translation. These callbacks dynamically adjust the loss
weight for the gradient produced during backpropagation at each training epoch. Such a
strategy helps the model prioritize categories that are currently underperforming, thereby
enhancing overall translation performance. All experiments use the same dataset in each
test, with a batch size of 100 per iteration set and the same configuration with the same
number of neural nodes. In addition, there is a scheduler for adjusting the learning rate,
which reduces the learning rate when the loss becomes stagnant and then stops training
when the learning rate is reduced to 1 × 10−8 . We set our best model parameters according
to the maximum BLEU-4 scores on the validation set.
In addition to the benchmark attention model Transformer [6] and RNNSearch [5],
we also compare our proposed model with other state-of-the-art models: GRU-Inv [41] is
an advanced GRU-gated attention model; Transformer Cycle [49] consists of three main
strategies and shares parameters in a universal transformer model; Attention+Rep [50]
introduces a lightweight computational model by using a random sample of tokens as
input to the perturbation; T5-11B [51] proposes a unified framework for converting all
text-based language problems into a text-to-text format with numerous parameters in
the attention model. Q-Learning [52] makes use of the reinforcement learning approach
to achieve NMT. Note that we have also re-implemented these discovered models and
fine-tuned their settings to give them the best possible scores for comparison purposes.
The experimental results are indicated in Table 1. The results trained using the CAAtt
network are better than the benchmark results of the Transformer and RNNSearch models,
and the combination of CAAtt and CAEmbed further outperforms the state-of-the-art model
in terms of BLEU-4 scores. It can be seen that the improvement of L1 (exp( Nx )) is not
significant instead its error range is smaller than the others. Additionally, since the T5-11B
Mathematics 2024, 12, 997 12 of 19

and our work are also based on the attention mechanism, we have achieved good results
in general tasks. In practice, since the proposed CAEmbed layer performs only a short
recurrence in the middle of a sentence, instead of receiving the whole sequence, it can
perform well in predicate translation. In contrast, T5-11B is trained on a large dataset
and can cover more predicates with postpositions in the dataset, thus achieving the recent
results obtained in this work. Overall, the proposed composite model of CAAtt and CAEmbed
with L1 (exp( Nx )) approach yields an average BLEU-4 score of 32.09 and 34.31 in WMT14
and WMT17, respectively (Note that BLEU-4 scores are validated in case insensitive because
we have Truecasing pre-processing). This is reasonable because the CAAtt aims to improve
the encoding process of the attention mechanism, the next CAEmbed purposes to refine the
content adaption of word embedding and decoding, and L1 (exp( Nx )) conducts to enhance
their convergence making the hidden states more discriminative.

Table 1. BLEU-4 scores with error ranges obtained from various translation models.

Models WMT14 WMT14 WMT17 WMT17


(En to De) (De to En) (En to De) (De to En)
Transformer [6] 31.28 ± 0.50 29.22 ± 0.50 32.78 ± 0.43 32.62 ± 0.42
Benchmark
RNNSearch [5] 31.45 ± 0.52 30.43 ± 0.30 32.83 ± 0.42 32.04 ± 0.36
GRU-Inv [41] 31.84 ± 0.41 29.81 ± 0.42 32.61 ± 0.52 31.48 ± 0.28
T5-11B [51] 30.75 ± 0.55 30.15 ± 0.35 32.92 ± 0.55 32.40 ± 0.23
State-Of-The-Art Attention+Rep [50] 32.15 ± 0.23 31.39 ± 0.43 33.96 ± 0.40 33.61 ± 0.41
Transformer Cycle [49] 32.26 ± 0.49 31.88 ± 0.20 34.30 ± 0.44 33.79 ± 0.27
Q-Learning [52] 31.38 ± 0.66 31.51 ± 0.31 32.85 ± 0.59 31.97 ± 0.24
CAAtt 31.56 ± 0.28 30.51 ± 0.39 32.92 ± 0.24 32.59 ± 0.29
Proposed CAAtt + CAEmbed 32.31 ± 0.19 31.94 ± 0.27 34.51 ± 0.21 34.16 ± 0.26
CAAtt + CAEmbed + L1 (exp( Nx )) 32.48 ± 0.11 31.62 ± 0.14 34.27 ± 0.12 34.29 ± 0.08

6.1. Discriminative of Phrases


We present a qualitative analysis of how CAEmbed highlights the results of the proposed
CAAtt attention mechanism through visualization of partial recurrent. The translated results
from the original attention are compared with the optimized results obtained by CAEmbed
layer. We study the correct translation ratio of phrases during post-processing to see
how our approach improves the translation quality. We select sentences from the WMT14
(De to En) validation set that are more difficult to translate correctly based on the length
and structure of the sentences, the use of phrases, and the number of punctuations for
verification. We translate them using the proposed network with and without CAEmbed
layer and various state-of-the-art models, and highlight the phrases in the target sentences
produced by these well-trained models.
As shown in Table 2, we observe that only our proposed model and Transformer
Cycle correctly translate the phrase (looked up), and the other results provide alternatives
(looked for, checked) with the same meaning, but none of them has the same tense as the
reference phrase. In fact, the contexts of the references all use the past tense, so to keep the
tense consistent, we believe that the tense of this translated sentence should also be in the
past tense, rather than using the present tense of the reference. Furthermore, as sentence
length increases, the errors are mainly in the latter part of the sentence. The attention
mechanism becomes more difficult to align with the source, which leads to more translation
errors and eventually damages the translation quality: It is acceptable to replace “condition”
and “association” with “situation” and “society”, receptivity, but the lack of “association”
is unacceptable, and “disease” is even more of a mistranslation. The incorrect attention
weights assigned by the original attention module are appropriately alleviated by CAAtt,
because it connects the decoded features to the attention layer thus having the ability to
Mathematics 2024, 12, 997 13 of 19

compensate and provide sufficient feedback to the recurrent network corresponding to the
red arrow in Figure 2.

Table 2. Selected example of translations generated by different models. We underline the phrases
and bold the interesting parts for investigation. The BLEU-4 scores reflect the better performance of
the proposed model.

“Ich fühlte mich schuldig, weil ich neugierig gewesen war,


also habe ich das niemandem erzählt, aber ich ging nach Hause,
Source BLEU-4
suchte nach der Krankheit in der Bibliothek und schrieb an die FA-Gesellschaft,”
sagte sie.
“I felt guilty because I’d been nosy and
so I didn’t tell anybody, but I did come home,
Reference look up the condition at the library and wrote to the FA Association,” 1.000
she said.
“I felt guilty because I was nosy,
so I didn’t tell anyone, but I went home,
Proposed looked up the situation at the library and wrote to the FA society,” 0.403
she said.
“I felt guilty because I was curious,
Transformer Cycle so I didn’t tell anyone, but I went home,
[49] looked up the disease in the library and wrote to the FA,” 0.331
she said.
“I felt guilty for being curious,
Attention+Rep so I didn’t tell anyone, but I went home and
[50] looked for the disease in the library and wrote to the FA society,” 0.321
she said.
“I felt guilty because I was nosy
T5-11B so I didn’t tell anyone, but I did go home,
[51] checked the situation at the library and wrote to the FA,” 0.381
she said.
“I felt guilty because I was nosy,
GRU-Inv so I didn’t tell anyone, but I did go home,
[41] checked the situation in the library and wrote to the FA,” 0.377
she said.

This can be identified in the example of attention weights for attentional reinforcement
and correction of phrase in Table 3. It is clear that the weights of punctuation are quite
small, always less than 0.02, and the next smallest is the conjunction, between 0.02 and 0.04.
The lower weight means that the translated result can still roughly express the original
meaning of the sentence, even if they are lost during the translation process. The next
notable ones are adjectives and nouns, with about and over 0.2. This means that they
must not be lost in the translation process, otherwise the meaning of the sentence will be
incomplete. The above is the attention weights obtained from the proposed CAAtt network.
Besides, the weights of phrase “looked up” have been enhanced by the CAEmbed layer.
As a result, the number and order of words between the source and target sentences are
allowed to differ due to the attention mechanism. In the proposed network, the current
translated word is connected to CAAtt as a feedback feature and acts on the CARU gate
of CAAtt through (5a) to (5c), thus improving the match between the original and target
word, and allowing a more concise and accurate allocation of attention. In practice, we find
that the transition from the initial weights to the final well-trained weights is stable and
continuous, with the major words in the training process obtaining more attention weights
as the iterations increase.
Mathematics 2024, 12, 997 14 of 19

Table 3. Calculating of attention weights for proposed CAAtt network based on the example sentence
for the attentional reinforcement and correction of phrase.

“ I felt guilty because I was nosy ,


0.0087 0.1973 0.2507 0.0919 0.0332 0.1964 0.0196 0.0935 0.0036
so I didn’t tell anyone , but I went home ,
0.0255 0.1928 0.0053 0.2220 0.0986 0.0019 0.0200 0.1880 0.2284 0.1970 0.0022
looked up the situation at the library and wrote to the FA society ,
0.3514 0.2337 0.0161 0.1610 0.0138 0.0148 0.2188 0.0084 0.2298 0.0158 0.0121 0.2286 0.2049 0.0086
” she said .
0.0037 0.1711 0.2275 0.0090

6.2. Convergence Performance


Next, we provide the performance analysis of the proposed normalisation method
described in Section 4.2 within the framework of the CAAtt network. Since the purpose of
Softmax classifier is to normalise the features to minimise the overall loss, increasing the
number of categories N generates more gradients as feedback for training, motivating it to
approach the target quickly. Therefore, better performance can be obtained as the number
of categories N increases. In order to validate the performance of proposed normalisation
approach, we apply it to the NMT experiments on the Multi30k [53] dataset, which can
highlight faster convergence due to the smaller size of its dataset.
As indicated in Figures 4 and 5, the proposed model achieves a better convergence
speed and BLEU-4 score. Compared with the selected Transformer model, our network
provides a stable and higher score. In general, the problem of missing phrases is not
well addressed because the attention mechanism tends to induce n-gram problems in the
translation process. With the enhancement of CAEmbed layer, it can alleviate the absence of
translated phrase-token such that the attention weights can tend to produce more phrase
content in the translation. Note that in Figure 5, the blue and green curves obtain very
close scores, which are trained with and without our proposed L1 (exp( Nx )) normalization,
respectively. Besides, all the experiments can be well-trained within 200 epochs, and our
proposed network can reach the best scores within 100 epochs and keep on this score until
the end of training. Also, as can be seen in Figure 4, applying L1 (exp( Nx )) can significantly
enhance the convergence speed and can generate more gradients for the feedback of
discriminative context vectors, making NMT suffer less from the n-gram problems and
thus completing the training quickly.

0.40

Transformer
0.35 CAAtt
CAAtt+CAEmbed
CAAtt+CAEmbed+L1 (exp (N x))
0.30

0.25
Training loss

0.20

0.15

0.10

0.05

0.00
0 20 40 60 80 100 120 140 160 180 200
Epoch

Figure 4. Convergence tests are performed on the Multi30k using various methods.
Mathematics 2024, 12, 997 15 of 19

0.34

0.32

0.30

BLEU-4 scores
0.28

0.26

0.24
Transformer
CAAtt
0.22 CAAtt+CAEmbed
CAAtt+CAEmbed+L1 (exp (N x))

0.20
0 20 40 60 80 100 120 140 160 180 200
Epoch

Figure 5. BLEU-4 scores obtained on Multi30k using various methods.

6.3. Content-Adaptive Gate Weighting


To find the underlying reason, we investigate the weights generated by the content-
adaptive gates of CARU, with respect to (5d) within the CAAtt and CAEmbed, aiming to
understand what type of part-of-speech this gate prefers to penalize or increase. We
collect the top 20 tokens that appeared most frequently during the pre-processing of
vocabulary constructions, and the attended tokens should have higher weight scores after
the inner product. Also, we excluded lots of tokens being specific tokens as “<unk>” or
“<pad>”, which are used for the compatibility of model training but are not meaningful in
our evaluation.
As indicated in Figure 6, the CAAtt can be seen as evolving from a task based on NLP
labeling, and the results of CAEmbed tends to phrase detection function. Observing the
results of CAAtt, as mentioned in Section 6.1, the lower weight token is stopped words,
such as “a/an”, “the” and punctuation, which can be uninformative for the NLP task.
In contrast, noun, verb, and adjective tokens can always be given a higher weight and their
features contain the major information that must be translated. Besides, CAEmbed adjusts the
weight distribution of frequently highlighted tokens to a more reasonable proportion. As a
result, phrase-related tokens, such as “looking”, “to”, and “with”, are further highlighted.

a
. CAAtt
CAEmbed
in
the
on
man
is
and
of
with
woman
,
two
are
to
people
at
an
looking
it

Figure 6. The weight of top-20 tokens generated by CARU’s content adaptive gate in CAAtt and
CAEmbed, respectively.
Mathematics 2024, 12, 997 16 of 19

6.4. Attention Weights Interpretability


Finally, we try to investigate the interpretability of the weights of proposed layers,
which can be considered as follows: If the context is more relevant, then the learned
attention weights should be consistent with the natural measure of feature importance,
and the proposed attention distribution will change the output distribution of the model
intensively. Given such an assumption, we explore the relative importance when the
distribution of attention weights is enhanced. We use i to denote the token with the highest
attention weights in a sentence and draw another token uniformly from the same sentence
by comparing the importance of i with other randomly attended tokens to compare how
i influences the output distribution of the proposed model. Concretely, we measure the
difference of two Jensen–Shannon (JS) divergences [54]:

∆JS = JS( p, qi ) − JS p, q j

(9)

where JS( p, q) is the JS divergence of the network’s reference output distribution p and the
output distribution q. Intuitively, the concentration in the output distribution of removing
q j should be obvious if qi is truly the most important token p, and the obtained ∆JS should
be obtained as a positive. With respect to Figure 4 and 5, based on the proposed attention
network with Multi30k dataset, we plot the ∆JS against ∆α = αi − α j .
As illustrated in Figure 7, it can be found that these results are near the diagonal
line because the ideal case is p = qi . According to this reason, we define the loss form as
|∆α−∆JS|
∑ √2 , which is the sum of each distance from point to the diagonal. For the Trans-
former attention result in Figure 7a, a fragmented distribution is presented, with only ∆α
close to 0.5 can obtain an acceptable ∆JS for training, but the loss is considerable which
means that improvement is still needed. For the improved attention of the proposed
method, the results show that the refined network can help to correctly align the most
intuitive token i. This can be found in Figure 7b, where CAAtt results tend to project on the
diagonal, while also appearing a normal distribution along the diagonal. Next, Figure 7c,d
presented the distribution results of applying the CAEmbed layer, which can significantly
enhance the concentration of the distribution results on the diagonal, indicating phrases
related to attention weights. The above analysis shows that they can bring interpretable
attention in a more intuitive. Finally, the proposed CAAtt + CAEmbed + L1 (exp( Nx )) frame-
work is more powerful to optimize attention mechanism in NMT tasks in terms of training
time and BLEU scores.

(a) (b) (c) (d)


Figure 7. The visualization of ∆JS divergence of the attention distribution against ∆α between the most
important token i and the randomly drawn token r. The results from left to right corresponding to the
networks of (a) Transformer, (b) CAAtt, (c) CAAtt + CAEmbed and (d) CAAtt + CAEmbed + L1 (exp( Nx )),
respectively.

7. Conclusions
In this work, we introduce an advanced network architecture called CARU-based Con-
tent Adaptive Attention (CAAtt) and CARU-based Embedding (CAEmbed) layer for Neural
Mathematics 2024, 12, 997 17 of 19

Machine Translation (NMT) and word embedding tasks. In CAAtt, we propose a novel
approach that connects decoded embedding features and generates new source represen-
tations based on content-adaptive gates contributed by CARU. This enables the attention
mechanism to focus on the current context of target sentences, enhancing translation quality.
To further improve training convergence, we introduce new normalization methods that
enhance the gradient of attention weights during processing. These methods, particularly
L1 (exp( Nx )) normalization, significantly speed up training convergence. Additionally,
CAEmbed uses CARU to refine embeddings through a partial recurrent layer, which per-
forms a short recurrent within the middle of sentences instead of receiving the whole
sequence, allowing it to handle the predicate with local attention and avoid long-term data
transfer between recurrent layers. This enhances the source representation associated with
previously decoded states, especially for phrases. Our in-depth analysis reveals that our
model effectively addresses long-term dependencies between sentences, further improving
translation quality. Experimental results demonstrate that CAAtt and CAEmbed outperform
state-of-the-art models, leading to improved BLEU scores. The utilization of the proposed
normalization methods also contributes to faster convergence during training. These find-
ings indicate that the attention mechanism architectures can be greatly enhanced using our
proposed methods, resulting in a noticeable increase in accuracy.
In our forthcoming study, we will refine the tone of NMT outputs to suit various
contextual expressions. The goal is to provide translations that are more natural, fluent,
contextual, and objective. Such adjustments help to enhance the quality of the translation,
making it more akin to human expression. Therefore, a more formal and objective tone
is preferred in business documents, while a more casual and relaxed tone is suitable for
informal chat dialogues.
Author Contributions: Conceptualization, K.-H.C. and S.-K.I.; methodology, K.-H.C. and S.-K.I.;
software, K.-H.C.; validation, S.-K.I.; formal analysis, K.-H.C. and S.-K.I.; investigation, K.-H.C.;
resources, K.-H.C. and S.-K.I.; data curation, K.-H.C. and S.-K.I.; writing—original draft preparation,
K.-H.C.; writing—review and editing, S.-K.I.; visualization, K.-H.C.; supervision, S.-K.I.; project
administration, S.-K.I.; funding acquisition, S.-K.I. All authors have read and agreed to the published
version of the manuscript.
Funding: This work is supported by the Macao Polytechnic University (Research Project RP/FCA-
06/2023).
Data Availability Statement: Data are contained within the article.
Acknowledgments: We would like to thank Laurie Cuthbert for English language editing.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Wang, X.; Lu, Z.; Tu, Z.; Li, H.; Xiong, D.; Zhang, M. Neural Machine Translation Advised by Statistical Machine Translation. In
Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [CrossRef]
2. Chan, K.H.; Ke, W.; Im, S.K. CARU: A Content-Adaptive Recurrent Unit for the Transition of Hidden State in NLP. In Neural
Information Processing; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 693–703. [CrossRef]
3. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215. [CrossRef]
4. Li, J.; Xiong, D.; Tu, Z.; Zhu, M.; Zhang, M.; Zhou, G. Modeling Source Syntax for Neural Machine Translation. In Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017 ;
pp. 688–697. [CrossRef]
5. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014,
arXiv:1409.0473. [CrossRef]
6. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.
arXiv 2017, arXiv:1706.03762. [CrossRef]
7. Liu, J.; Zhang, Y. Attention Modeling for Targeted Sentiment. In Proceedings of the 15th Conference of the European Chapter of
the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017. [CrossRef]
8. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings
of the 40th Annual Meeting on Association for Computational Linguistics—ACL '02, Philadelphia, PA, USA, 7–12 July 2002.
[CrossRef]
Mathematics 2024, 12, 997 18 of 19

9. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations
using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [CrossRef]
10. Wang, X.X.; Zhu, C.H.; Li, S.; Zhao, T.J.; Zheng, D.Q. Neural machine translation research based on the semantic vector of the
tri-lingual parallel corpus. In Proceedings of the 2016 International Conference on Machine Learning and Cybernetics (ICMLC),
Jeju, Republic of Korea, 10–13 July 2016. [CrossRef]
11. Garg, S.; Peitz, S.; Nallasamy, U.; Paulik, M. Jointly Learning to Align and Translate with Transformer Models. In Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–9 November 2019. [CrossRef]
12. Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [CrossRef]
13. Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421.
[CrossRef]
14. Fan, H.; Zhang, X.; Xu, Y.; Fang, J.; Zhang, S.; Zhao, X.; Yu, J. Transformer-based multimodal feature enhancement networks for
multimodal depression detection integrating video, audio and remote photoplethysmograph signals. Inf. Fusion 2024, 104, 102161.
[CrossRef]
15. Huang, P.Y.; Liu, F.; Shiang, S.R.; Oh, J.; Dyer, C. Attention-based Multimodal Neural Machine Translation. In Proceedings of
the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, 11–12 August 2016; pp. 639–645.
[CrossRef]
16. Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; Li, H. Modeling Coverage for Neural Machine Translation. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 76–85. [CrossRef]
17. Kazimi, M.B.; Costa-jussà, M.R. Coverage for Character Based Neural Machine Translation. Proces. Del Leng. Nat. 2017, 59, 99–106.
18. Cheng, R.; Chen, D.; Ma, X.; Cheng, Y.; Cheng, H. Intelligent Quantitative Safety Monitoring Approach for ATP Using LSSVM
and Probabilistic Model Checking Considering Imperfect Fault Coverage. IEEE Trans. Intell. Transp. Syst. 2023, Early Access.
19. Mi, H.; Sankaran, B.; Wang, Z.; Ittycheriah, A. Coverage Embedding Models for Neural Machine Translation. In Proceedings of
the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 955–960.
[CrossRef]
20. Douzon, T.; Duffner, S.; Garcia, C.; Espinas, J. Long-Range Transformer Architectures for Document Understanding. In Document
Analysis and Recognition—ICDAR 2023 Workshops; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023;
pp. 47–64. [CrossRef]
21. Tang, G.; Müller, M.; Rios, A.; Sennrich, R. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation
Architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium,
31 October–4 November 2018; pp. 4263–4272. [CrossRef]
22. Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; Smola, A. Neural Machine Translation with Recurrent Attention Modeling. In Proceedings of
the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017;
pp. 383–387. [CrossRef]
23. Mondal, S.K.; Zhang, H.; Kabir, H.D.; Ni, K.; Dai, H.N. Machine translation and its evaluation: A study. Artif. Intell. Rev. 2023, 56,
10137–10226. [CrossRef]
24. Cohn, T.; Hoang, C.D.V.; Vymolova, E.; Yao, K.; Dyer, C.; Haffari, G. Incorporating Structural Alignment Biases into an Attentional
Neural Translation Model. In Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016. [CrossRef]
25. Rosendahl, J.; Herold, C.; Petrick, F.; Ney, H. Recurrent Attention for the Transformer. In Proceedings of the Second Workshop on
Insights from Negative Results in NLP, Online and Punta Cana, Dominican Republic, 1 November 2021; pp. 62–66. [CrossRef]
26. Yazar, B.K.; Şahın, D.Ö.; Kiliç, E. Low-Resource Neural Machine Translation: A Systematic Literature Review. IEEE Access 2023,
11, 131775–131813. [CrossRef]
27. Zhang, B.; Xiong, D.; Su, J. Neural Machine Translation with Deep Attention. IEEE Trans. Pattern Anal. Mach. Intell. 2020,
42, 154–163. [CrossRef] [PubMed]
28. Vishnu, P.R.; Vinod, P.; Yerima, S.Y. A Deep Learning Approach for Classifying Vulnerability Descriptions Using Self Attention
Based Neural Network. J. Netw. Syst. Manag. 2021, 30, 9. [CrossRef]
29. Sethi, N.; Dev, A.; Bansal, P.; Sharma, D.K.; Gupta, D. Enhancing Low-Resource Sanskrit-Hindi Translation through Deep
Learning with Ayurvedic Text. ACM Trans. Asian -Low-Resour. Lang. Inf. Process. 2023. [CrossRef]
30. Shan, Y.; Feng, Y.; Shao, C. Modeling Coverage for Non-Autoregressive Neural Machine Translation. In Proceedings of the 2021
International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021. [CrossRef]
31. Zhou, L.; Zhang, J.; Zong, C. Improving Autoregressive NMT with Non-Autoregressive Model. In Proceedings of the First
Workshop on Automatic Simultaneous Translation, Seattle, WA, USA, 9–10 July 2020. [CrossRef]
32. Wu, L.; Tian, F.; Qin, T.; Lai, J.; Liu, T.Y. A study of reinforcement learning for neural machine translation. arXiv 2018,
arXiv:1808.08866.
33. Aurand, J.; Cutlip, S.; Lei, H.; Lang, K.; Phillips, S. Deep Q-Learning for Decentralized Multi-Agent Inspection of a Tumbling
Target. J. Spacecr. Rocket. 2024, 1–14.. [CrossRef]
Mathematics 2024, 12, 997 19 of 19

34. Kumari, D.; Ekbal, A.; Haque, R.; Bhattacharyya, P.; Way, A. Reinforced nmt for sentiment and content preservation in
low-resource scenario. Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 20, 1–27. [CrossRef]
35. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw.
1994, 5, 157–166. [CrossRef]
36. Trinh, T.H.; Dai, A.M.; Luong, M.T.; Le, Q.V. Learning Longer-term Dependencies in RNNs with Auxiliary Losses. arXiv 2018,
arXiv:1803.00144. [CrossRef]
37. Houdt, G.V.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955.
[CrossRef]
38. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
39. Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder–Decoder
Approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha,
Qatar, 25 October 2014. [CrossRef]
40. Dey, R.; Salem, F.M. Gate-variants of Gated Recurrent Unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th
International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017. [CrossRef]
41. Zhang, B.; Xiong, D.; Xie, J.; Su, J. Neural Machine Translation With GRU-Gated Attention Model. IEEE Trans. Neural Netw. Learn.
Syst. 2020, 31, 4688–4698. [CrossRef] [PubMed]
42. Cao, Q.; Xiong, D. Encoding Gated Translation Memory into Neural Machine Translation. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [CrossRef]
43. Chan, K.H.; Im, S.K.; Ke, W. Multiple classifier for concatenate-designed neural network. Neural Comput. Appl. 2021, 34, 1359–1372.
[CrossRef]
44. Ranjan, R.; Castillo, C.D.; Chellappa, R. L2-constrained Softmax Loss for Discriminative Face Verification. arXiv 2017,
arXiv:1703.09507. [CrossRef]
45. Lita, L.V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. tRuEcasIng. In Proceedings of the 41st Annual Meeting on Association for
Computational Linguistics—ACL’03, Sapporo, Japan, 7–12 July 2003. [CrossRef]
46. Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [CrossRef]
47. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing
Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14
December 2019 ; Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.:
New York, NY, USA, 2019; Volume 32.
48. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [CrossRef]
49. Takase, S.; Kiyono, S. Lessons on Parameter Sharing across Layers in Transformers. arXiv 2021, arXiv:2104.06022. [CrossRef]
50. Takase, S.; Kiyono, S. Rethinking Perturbations in Encoder-Decoders for Fast Training. In Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT
2021, Online, 6–11 June 2021. [CrossRef]
51. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer
Learning with a Unified Text-to-Text Transformer. arXiv 2019, arXiv:1910.10683. [CrossRef]
52. Kumar, G.; Foster, G.; Cherry, C.; Krikun, M. Reinforcementearning based curriculum optimization for neural machine translation.
arXiv 2019, arXiv:1903.00041.
53. Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the
5th Workshop on Vision and Language, Berlin, Germany, 12 August 2016. [CrossRef]
54. Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium
onInformation Theory, ISIT 2004, Chicago, IL, USA, 27 June–2 July 2004. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like