Mathematics 12 00997
Mathematics 12 00997
Article
Neural Machine Translation with CARU-Embedding Layer and
CARU-Gated Attention Layer
Sio-Kei Im 1,2 and Ka-Hou Chan 1,2, *
Abstract: The attention mechanism performs well for the Neural Machine Translation (NMT) task,
but heavily depends on the context vectors generated by the attention network to predict target
words. This reliance raises the issue of long-term dependencies. Indeed, it is very common to combine
predicates with postpositions in sentences, and the same predicate may have different meanings
when combined with different postpositions. This usually poses an additional challenge to the NMT
study. In this work, we observe that the embedding vectors of different target tokens can be classified
by part-of-speech, thus we analyze the Natural Language Processing (NLP) related Content-Adaptive
Recurrent Unit (CARU) unit and apply it to our attention model (CAAtt) and embedding layer
(CAEmbed). By encoding the source sentence with the current decoded feature through the CARU,
CAAtt is capable of achieving translation content-adaptive representations, which attention weights
are contributed and enhanced by our proposed L1 (exp( Nx )) normalization. Furthermore, CAEmbed
aims to alleviate long-term dependencies in the target language through partial recurrent design,
performing the feature extraction in a local perspective. Experiments on the WMT14, WMT17, and
Multi30k translation tasks show that the proposed model achieves improvements in BLEU scores
and enhancement of convergence over the attention-based plain NMT model. We also investigate
the attention weights generated by the proposed approaches, which indicate that refinement over
the different combinations of adposition can lead to different interpretations. Specifically, this work
provides local attention to some specific phrases translated in our experiment. The results demonstrate
that our approach is effective in improving performance and achieving a more reasonable attention
Citation: Im, S.-K.; Chan, K.-H. distribution compared to the state-of-the-art models.
Neural Machine Translation with
CARU-Embedding Layer and
Keywords: neural network; Neural Machine Translation (NMT); Natural Language Processing (NLP);
CARU-Gated Attention Layer.
attention mechanism; Content-Adaptive Recurrent Unit (CARU)
Mathematics 2024, 12, 997. https://
doi.org/10.3390/math12070997
MSC: 68T07; 68T50
Academic Editor: Danilo Costarelli
the encoder and decoder. By dynamically detecting the pertinent source word to predict the
forthcoming target word, it generates a context vector. Intuitively, different target words
will align to different source words which results in varied context vectors during decoding.
These context vectors must be discriminatory enough to predict the target words accurately,
or the same target words may be repeatedly generated. However, this is typically not the
case in practice, even when the attended source words are relevant. We note that the context
vectors are highly comparable to each other, with minor variations in each dimension across
decoding steps. This indicates that the Vanilla Attention mechanism does not accurately
differentiate between various translation predictions. We believe that the explanation for
this is linked to the configuration of the attention mechanism, which provides a weighted
sum of source representations (hidden features from the encoder) that are invariant across
the decoding step.
Figure 1 illustrates the enhancement in the quality of translation over the past few
years, as measured by the BLEU (The most commonly used metric for evaluating ma-
chine translation systems is the BLEU-4.) score [8], along with the corresponding NMT
model. The performance is greatly improved by applying the attention approach under
high-resource conditions (One of the most popular datasets used to benchmark machine
translation systems is the WMT family of datasets, most referenced to WMT14 and WMT16).
Therefore, we aim to explore the practicality of extending attention models for enhanc-
ing the capacity of these vectors. This promising approach can significantly enhance the
translation performance of the model by boosting its discriminative capability.
34
Attention + Transformer Cycle (Rev)
Attention + Rep(Uni)
32
T5-11B + Attention
30
SCORE
26 Transformer Big
24
2018 2019 2020 2021
Figure 1. BLEU-4 scores for various NMT models. The “Attention” architectures have contributed
to major improvements in machine translation. The quality of the attention-related models outper-
formed the other models, with only the attention approach scoring greater than 30.0 (as measured by
case-insensitive BLEU-4).
In this work, we introduce a new CARU-gated attention layer (CAAtt) and a CARU-
Embedding (CAEmbed) layer for decoded word embedding in NMT. The overall framework
of our model, as illustrated in Figure 2, present the structural and feature connections
between the two CAAtt and CAEmbed layers. In particular, CAAtt expands the Vanilla
Attention network by inserting a gating layer based on the concept of Content-Adaptive
Recurrent Unit (CARU) [2]. CARU utilizes the original source representation as its history
and the corresponding previous decoder state as its current input. In this way, CAAtt can
focus on creating source representations that take into account translation effects. This
helps enhance the discriminative power of the context vector, making it more effective in
predicting the subsequent target term. By considering the impact of translation, the model
can better understand the context and make more accurate predictions. Afterward, CAEmbed
enhances word embedding by combining it with part-CARU. This integration involves
processing only the short-term hidden state(s) in a partial loop. This technique optimizes
the key information present in the embedded vector. Specifically, it aims to reduce the
Mathematics 2024, 12, 997 3 of 19
reliance on punctuation and increase the adaptation of relevant keywords. This approach
proves to be particularly advantageous for non-English sentences, where the structure and
grammar can differ significantly from English. It is also beneficial for languages that utilize
postpositions extensively. Through the combination of CAAtt and CAEmbed, the translation
process can be improved, leading to more accurate and effective results.
x0 x1 x2 x3
y0 y1 y2 y3
Figure 2. An overview of the Vanilla Attention network with proposed layers of CAAtt and CAEmbed.
The source and target sides are denoted by blue and yellow, and the green and orange colors represent
the (embedded) information flow for target word prediction and attention, respectively. The red color
indicates the CARU-gated layer. “ATT” represents the procedure for calculating attention weights.
⃗st−1 refers to the previous decoder state, corresponding to the current step t. More details about the
equations can be found in Sections 4 and 5.
Considering that CARU has the ability to control/adjust the feature flow between
weight and current hidden state using its content-adaptive gate and update gate, we pro-
pose an adaptation of CAAtt that takes into account the previous decoder’s ⃗s<t state as
short-term history and the original source representation as the input feature. Furthermore,
to enhance word prediction accuracy, a context-search of part-CARU has been incorporated.
Both proposed layers are straightforward and efficient for the training and decoding of
NMT. The validation is conducted on both the Multi30k and WMT14 datasets to assess
their performance in English–German translation tasks. The experimental results report
that the proposed models outperform the attention-based plain NMT significantly. The gen-
erated attention weights and context vectors were also scrutinized, demonstrating that the
attention weights are more precise and context vectors are more discriminative.
The proposed model enhances the widely used attention networks and also improves
its convergence speed and performance. Our contributions are summarised as follows:
• We investigate the features of the context vectors produced by the Vanilla Attention
model to fully analyze their ability to discriminate between discourse and part-of-
speech, and find that the underlying case is that decoding invariant source representa-
Mathematics 2024, 12, 997 4 of 19
tions such as the weight of punctuation usually tends to dilute the information in the
entire sentence. We develop a CARU-gated attention (CAAtt) layer that dynamically
adjusts and refines the source representation based on partial translations.
• In order to increase the convergence speed, we introduce a normalization method and
involve it in the calculation of the attention weights. We also analyze its performance
and give a complete derivation procedure. Compared to Softmax, it provides stronger
gradients when there are numerous categories being predicted.
• Besides, the inference of the current predicate is also highly correlated with the next
word. In particular, various combinations of adposition can lead to different interpre-
tations. We introduce an alternative layer (CAEmbed) consisting of embedding and the
proposed part-CARU. It aims at weighting the current embedded vector in order to
enhance the consistency of target sentences.
• Several experiments on English–German translation tasks have been conducted. The re-
sults show that our model outperforms the baseline in dealing with phrases, generating
accurate attention weights, expanding the variability of the context vector, and im-
proving translation quality.
2. Related Work
In the early years, Sequence-to-Sequence (Seq2Seq) learning models are originally
proposed in the simple absence of an attention mechanism, relying mainly on an encoder
to project all features of the semantic details from the source-side into a fixed-length
vector [3,9,10]. Refs. [5,11] indicate that using a fixed-length vector is insufficient for
representing natural phrases. To address this, they introduced the attention mechanism
for neural machine translation, which enables automatic search for the part of the source
sentence, relevant to the next target word to be predicted. In practice, attention-based
models have demonstrated substantial improvements, outperforming other models in
machine translation tasks [12]. Also, ref. [13] investigates several effective approaches to
attentional weighting functions, applying local and global attention models in one-shot
modality, and then [14,15] enhance the attention-based NMT model by incorporating
information from multiple modalities. Refs. [16,17] introduces a coverage vector in order
to keep track of the history of attentional features, which allows the attentional network
to pay more intelligence to represent the source sentences, and also a coverage model
to track the coverage status using a full-coverage embedded vector [18,19]. In addition,
refs. [20,21] introduces the self-attention approach, which reduces the number of sequential
computations by short paths between distant words in entire paragraphs. It concludes that
these short paths are particularly useful for learning strong semantic feature extractors.
To supplement its long-term issue between sentences, ref. [22] brings it into recursion along
the context vector to help adjust future attention. Ref. [23] extends the (cross-)attention
mechanism by a recurrent connection to allow direct access to previous alignment decisions,
which incorporates several structural biases to improve the attention-based model by
involving Markov conditions, fertility, and consistency in the direction of translation [24,25].
Refs. [26,27] proposes deep attention based on the low-level attentional information that
can automatically determine the refinement of attentional weights in a layer-wire manner.
Currently, refs. [28,29] proposes a self-attention deep NMT model with a text mining
approach to identify the vulnerability category from paragraphs. All these models tend to
investigate the variability of context vectors by generating more accurate attention weights.
Instead, our approach enables dynamic adjustment and resorting to source representations
based on partial translations. This is distinct from existing literature and can be integrated to
enhance context vectors’ discriminative power. Additionally, the autoregressive model [30]
is another closely related approach, which introduces a better way to train the NMT model
by using the regression of features on themselves and directly supervising the attention
weights of the NMT using well-trained word alignment. Also, ref. [31] treats the source
representation as a memory/feature and models it when dealing with different time series
over a long-term time, providing significant flexibility in dealing with the decoder and
Mathematics 2024, 12, 997 5 of 19
the various different time-series patterns of that memory during translation. Moreover,
an advanced approach to Deep Reinforcement Learning (DRL) algorithms can also be
applied to NLP [32], such as the Q-Learning has demonstrated significant potential for
solving the NMT [33,34].
Inspired by the above approach, our proposed model treats the source representation
as a memory/feature and simulates the interaction between the decoder and this memory
through read and write operations during the translation process. This work aims to
highlight that employing the proposed composition of CARU (CAAtt and CAEmbed) in our
model can improve the efficiency of training and decoding, rather than merely addressing
the interactive attention content from the read operation. CAAtt can be seen as an extension
of Vanilla Attention, offering the advantage of alleviating gradient vanishing or exploding
problems during training [35,36]. It also simplifies LSTM [37,38] and presents a variant
design for GRU [39,40], enabling efficient computation. Additionally, the gate mechanism
in CAAtt is built on CARU. Consistency in language use is maintained throughout the
document. CARU is a recurrent unit that utilizes a content-adaptive gate and an update
gate to manage the importance of information from the hidden states and the current input
feature, respectively [2]. Within NMT, CARU has been proposed as both an encoder and
decoder function for the Seq2Seq model [2], and the option of using a gate design as an
alternative to the attention mechanism has also been presented [41,42]. To our knowledge,
the use of CARU as a gate mechanism has not been investigated previously.
3. Background
Similar to the basic Seq2Seq structure, the Vanilla Attention mechanism introduces
a context vector/module between the encoder and decoder. This context vector aims to
collect the output of all units as input features to compute the probability distribution of
the source language (embedded) words for each feature word that the decoder wants to
predicate. By applying this mechanism, the aim is to discover the relationship between the
encoder and decoder. This connection enables the decoder to capture global information
to a certain extent, instead of inferring only from the current hidden state. The attention
module identifies the source words pertinent to the succeeding target word and assigns
them with high attention weights while calculating the context vector ⃗ct by:
where ATT denotes the alignment function that uses anfeedforward oneural network to
calculate the attention weights ⃗αt,i of encoder states h = ⃗h0 ,⃗h1 , · · · ,⃗ht with the previous
decoder state ⃗st−1 . For a target word instance, it first reviews overall encoder states to
compare the target and source word with the aim of computing a score for each state of
the encoder. Next, a Softmax function normalizes all scores and produces a probability
distribution conditional on the target word for determination, as follows:
exp(⃗et,i )
⃗αt,i = (2)
∑k exp(⃗et,k )
Although this model outperforms better than the others, we find that (2) employs the
Softmax function meaning that the Vanilla Attention mechanism also inherits the shortcom-
ings of Softmax during the training process [43,44]: The convergence in the later stages
becomes very slow and the resulting context vectors are very similar to each other, making
them insufficiently discriminative. We conduct to improve the prediction and enhance the
convergence in the following sections.
4. CARU-Gated Attention
The Vanilla Attention mechanism is used to overcome the NMT problem by allow-
ing the network to return to the input sequence instead of encoding all the information
into a fixed-length vector, but it still has some shortcomings we mentioned before. This
mechanism allows the embedded vector to access its internal memory/feature, which is
the hidden states generated by the encoder. In this interpretation, the network chooses to
retrieve something from the current feature of this encoder state, rather than considering
whether and how much it should be attended. With this in mind, we attempt to investigate
the reasons behind this by analyzing the regression of the context vector ⃗ct in (3). Achieving
a one-to-one mapping between two different languages in a translation process is difficult,
not just for NMT, but also for SMT. Source words often align with multiple target words,
resulting in different sentence structures and sequences. This dilutes attention weights due
to the decoding steps between the source word and its corresponding target word position.
According to (1), the context vector ⃗ct comes up with decoder state ⃗st−1 , which is tem-
porarily activated in the (n − 1)-th recurrent step, and the attention weights is completely
dominated by encoder states h by (2). In practice, the attention mechanism in NMT learns
in an unsupervised manner without explicit prior knowledge about alignment, and there is
always a lot of redundant information dragging the alignment, such as punctuation and
conjunctions that often dilute the information of the encoder states h. Theoretically, this
redundant information can be identified from their part-of-speech (also determined by
pre-processing). We can thus make use of these attempts to reduce the proportion/weight
of redundant information in the encoder, thus making the major words/features more
obvious for subsequent consideration.
In order to achieve this improvement, CARU’s approach is well worth referring to in
our design. In each decoding step, these redundant features are mitigated by the proposed
CAAtt layer before the encoder states are input to the Vanilla Attention model. There are
two objectives that this proposed gate layer must achieve:
1. Being able to dynamically adjust the weight of each embedded word in a recurrent step,
the adjusted encoder state should be more meaningful and represent the translation
context clearly, so that the decoder can extract useful context vectors for attention.
2. We conduct to maintain the intention of (1) and therefore recommend that the feature
size and dimension produced by this layer must equal to that of the encoder state
(but the length is allowed to change), otherwise it will cause data sparsity problems,
making it unstable and hard to convergence.
In view of these considerations, the CARU-Gated layer is emitted, which dynamically
adapts h according to the previous decoder state ⃗st−1 employing a gate-controlled network,
expressed as:
g = CAAtt(h,⃗st−1 ) (4)
and the proposed layer then evolves by (1) to:
For the language/context vectors, humans/learning models can immediately capture the
meaning or pattern if they have learned it before, respectively. Corresponding to NLP, for a
standard sentence, its subject, verb, and object should receive more attention before other
Mathematics 2024, 12, 997 7 of 19
where ⊙ denotes the Hadamard product, and σ and ϕ denote the activation function of
sigmoid and hyperbolic tangent, respectively. It can be found that there are two data flows
(⃗nt,i and ⃗zt,i ) aimed to perform the linear combination of the current encoder state ⃗hi and
previous decoder state ⃗st−1 parallelly. (5b) is referred to as the design of an update gate,
which results in ⃗nt,i having been connected to the end of the combination of weights with
encoder states. Besides, (5c) determines to what extent the original source information
can be used to combine partial translations. In particular, (5d) is the CARU feature of the
context-adaptive gate that defines how much of the original source information can be
retained. The proposed CAAtt also incorporates the advantage of CARU that the σ (⃗xt,i )
in (5d) can be considered as a tagging task that connects the relation between the weight
and part-of-speech. For instance, the word-weight σ (⃗xt,i ) should be close to zero if the ⃗st−1
representing a punctuation (such as “full stop”), implied as:
σ (⃗xt,i ) → 0̄
=⇒ ⃗lt,i = σ(⃗xt,i ) ⊙⃗zt,i → 0̄
=⇒ ⃗gt,i = 1̄ − ⃗lt,i ⊙ ⃗hi + ⃗lt,i ⊙ ⃗nt,i → ⃗hi
where the result of the content-adaptive gate⃗lt,i will be close to zero regardless of the content
weight ⃗zt,i , which means that the produced state ⃗gt,i will also converge to the encoder state
⃗hi . It shows that low-weight words have less influence on the output. Finally, the use
of linear interpolation between ⃗hi and ⃗nt,i ensures that the refined source representations
satisfy the requirement above-mentioned: It can alleviate the complex interactions between
source sentences and partial translations, allowing CAAtt to efficiently control the matching
and data flow between them.
Mathematics 2024, 12, 997 8 of 19
ht ⊙ ⊕ ht+1
Content-Adaptive Gate
1−
⊕ σ ⊙ ⊙
⊕ ϕ
st
1 f 2 ( xi ) ∂ f 2 ( xi ) 1 f 1 ( xi ) ∂ f 1 ( xi )
∑ f2 (x ) 1 − ∑ f2 (x ) ≥ 1−
∑k 1 ( k )
f x ∑k 1 ( k )
f x
k k k k ∂x j ∂x j
.
1 f (x ) ∂ f 2 ( xi ) 1 f (x ) ∂ f 1 ( xi )
− 2 i ≥ − 1 i
∑k f 2 ( xk ) ∑k f 2 ( xk ) ∑k f 1 ( xk ) ∑k f 1 ( xk )
∂x j ∂x j
By definition, f 1 ( x ) = exp( x ) ensures that its range belongs to R+ , and the range of the
proposed f 2 ( x ) = exp( Nx ) is also belong to R+ . We simplify these inequalities and obtain,
∑ k f 1 ( x k ) ∑ k f 1 ( x k ) ∂ f 2 ( xi ) ∑ k f 1 ( x k ) − f 1 ( xi ) ∂ f 1 ( xi )
∑ f2 (x ) ∑ f2 (x ) ≥
∑ k f 2 ( x k ) − f 2 ( xi )
k k k k ∂x j ∂x j
.
∑ k f 1 ( x k ) ∑ k f 1 ( x k ) ∂ f 2 ( xi )
f 1 ( xi ) ∂ f 1 ( xi )
≥
f 2 ( xi ) ∑k f 2 ( xk ) ∑k f 2 ( xk )
∂x j ∂x j
Rewrite as:
∑ f ( x ) ∑ k f 1 ( x k ) ∂ f 2 ( xi ) ∑ k f 1 ( x k ) − f 1 ( xi ) ∂ f 1 ( xi )
f 1 ( xi ) ∂ f 1 ( xi )
≥ k 1 k ≥ . (7)
f 2 ( xi ) ∂x j ∑k f 2 ( xk ) ∑k f 2 ( xk ) ∂x j ∑ k f 2 ( x k ) − f 2 ( xi ) ∂x j
Mathematics 2024, 12, 997 9 of 19
It thus can be found that (7) leads to the necessary condition as:
∑ k f 1 ( x k ) − f 1 ( xi ) ∂ f 1 ( xi ) ∑ f (x ) ∑ f (x )
f 1 ( xi ) ∂ f 1 ( xi )
≥ =⇒ k 2 k ≥ k 1 k .
f 2 ( xi ) ∂x j ∑ k f 2 ( x k ) − f 2 ( xi ) ∂x j f 2 ( xi ) f 1 ( xi )
It is obvious that N always increases their gradient as long as N > 1 whether i is equal
to j or not. Based on the above justification, the proposed normalization method has the
ability to integrate into the attention weights ⃗αt,i , and outperforms the Softmax function as
(2). Moreover, with respect to (6), it is worth mentioning that the result of ∑ j ∂x∂ L1 ( f ( x )) is
j
always equal to zero, which means that the normalization procedure (whether proposed,
Softmax or others) only contributes to the convergence within the attention weights, allow-
ing for faster training, but providing no additional gradient to the final accuracy of the
overall network.
t
P(⃗s1:t |⃗c1:t ) = ∏ P(⃗si |⃗s1:i−1 , ⃗c1:t ). (8)
i=t− p
It can be discovered that the conditional probability of the current ⃗st is only affected by
p + 1 vectors, unlike the benchmark NMT approach that always takes into account all
previous ⃗s<t , which often leads to the long-term issue. Therefore, the probabilities obtained
from (8) can achieve more accurate results than others. Besides, the context vector ⃗c1 must
be entered as the last input factor because it is encoded by the attention model and it also
contains quite important information about the hidden state. We arrange it as the last input
in order to prevent it from being diluted by other input features that follow. Intuitively,
the proposed layer of CAEmbed consists of one CARU that partially encodes the previous p
embedded vector s ∈ ⃗st− p , · · · ,⃗st−1 and the context vector ⃗ct generated by the attention
model. A complete algorithm (Algorithm 1) of CAEmbed is as follows:
of (long-term) data transfer between recursive layers, which effectively alleviates data
transfer and storage overhead.
and our work are also based on the attention mechanism, we have achieved good results
in general tasks. In practice, since the proposed CAEmbed layer performs only a short
recurrence in the middle of a sentence, instead of receiving the whole sequence, it can
perform well in predicate translation. In contrast, T5-11B is trained on a large dataset
and can cover more predicates with postpositions in the dataset, thus achieving the recent
results obtained in this work. Overall, the proposed composite model of CAAtt and CAEmbed
with L1 (exp( Nx )) approach yields an average BLEU-4 score of 32.09 and 34.31 in WMT14
and WMT17, respectively (Note that BLEU-4 scores are validated in case insensitive because
we have Truecasing pre-processing). This is reasonable because the CAAtt aims to improve
the encoding process of the attention mechanism, the next CAEmbed purposes to refine the
content adaption of word embedding and decoding, and L1 (exp( Nx )) conducts to enhance
their convergence making the hidden states more discriminative.
Table 1. BLEU-4 scores with error ranges obtained from various translation models.
compensate and provide sufficient feedback to the recurrent network corresponding to the
red arrow in Figure 2.
Table 2. Selected example of translations generated by different models. We underline the phrases
and bold the interesting parts for investigation. The BLEU-4 scores reflect the better performance of
the proposed model.
This can be identified in the example of attention weights for attentional reinforcement
and correction of phrase in Table 3. It is clear that the weights of punctuation are quite
small, always less than 0.02, and the next smallest is the conjunction, between 0.02 and 0.04.
The lower weight means that the translated result can still roughly express the original
meaning of the sentence, even if they are lost during the translation process. The next
notable ones are adjectives and nouns, with about and over 0.2. This means that they
must not be lost in the translation process, otherwise the meaning of the sentence will be
incomplete. The above is the attention weights obtained from the proposed CAAtt network.
Besides, the weights of phrase “looked up” have been enhanced by the CAEmbed layer.
As a result, the number and order of words between the source and target sentences are
allowed to differ due to the attention mechanism. In the proposed network, the current
translated word is connected to CAAtt as a feedback feature and acts on the CARU gate
of CAAtt through (5a) to (5c), thus improving the match between the original and target
word, and allowing a more concise and accurate allocation of attention. In practice, we find
that the transition from the initial weights to the final well-trained weights is stable and
continuous, with the major words in the training process obtaining more attention weights
as the iterations increase.
Mathematics 2024, 12, 997 14 of 19
Table 3. Calculating of attention weights for proposed CAAtt network based on the example sentence
for the attentional reinforcement and correction of phrase.
0.40
Transformer
0.35 CAAtt
CAAtt+CAEmbed
CAAtt+CAEmbed+L1 (exp (N x))
0.30
0.25
Training loss
0.20
0.15
0.10
0.05
0.00
0 20 40 60 80 100 120 140 160 180 200
Epoch
Figure 4. Convergence tests are performed on the Multi30k using various methods.
Mathematics 2024, 12, 997 15 of 19
0.34
0.32
0.30
BLEU-4 scores
0.28
0.26
0.24
Transformer
CAAtt
0.22 CAAtt+CAEmbed
CAAtt+CAEmbed+L1 (exp (N x))
0.20
0 20 40 60 80 100 120 140 160 180 200
Epoch
a
. CAAtt
CAEmbed
in
the
on
man
is
and
of
with
woman
,
two
are
to
people
at
an
looking
it
Figure 6. The weight of top-20 tokens generated by CARU’s content adaptive gate in CAAtt and
CAEmbed, respectively.
Mathematics 2024, 12, 997 16 of 19
∆JS = JS( p, qi ) − JS p, q j
(9)
where JS( p, q) is the JS divergence of the network’s reference output distribution p and the
output distribution q. Intuitively, the concentration in the output distribution of removing
q j should be obvious if qi is truly the most important token p, and the obtained ∆JS should
be obtained as a positive. With respect to Figure 4 and 5, based on the proposed attention
network with Multi30k dataset, we plot the ∆JS against ∆α = αi − α j .
As illustrated in Figure 7, it can be found that these results are near the diagonal
line because the ideal case is p = qi . According to this reason, we define the loss form as
|∆α−∆JS|
∑ √2 , which is the sum of each distance from point to the diagonal. For the Trans-
former attention result in Figure 7a, a fragmented distribution is presented, with only ∆α
close to 0.5 can obtain an acceptable ∆JS for training, but the loss is considerable which
means that improvement is still needed. For the improved attention of the proposed
method, the results show that the refined network can help to correctly align the most
intuitive token i. This can be found in Figure 7b, where CAAtt results tend to project on the
diagonal, while also appearing a normal distribution along the diagonal. Next, Figure 7c,d
presented the distribution results of applying the CAEmbed layer, which can significantly
enhance the concentration of the distribution results on the diagonal, indicating phrases
related to attention weights. The above analysis shows that they can bring interpretable
attention in a more intuitive. Finally, the proposed CAAtt + CAEmbed + L1 (exp( Nx )) frame-
work is more powerful to optimize attention mechanism in NMT tasks in terms of training
time and BLEU scores.
7. Conclusions
In this work, we introduce an advanced network architecture called CARU-based Con-
tent Adaptive Attention (CAAtt) and CARU-based Embedding (CAEmbed) layer for Neural
Mathematics 2024, 12, 997 17 of 19
Machine Translation (NMT) and word embedding tasks. In CAAtt, we propose a novel
approach that connects decoded embedding features and generates new source represen-
tations based on content-adaptive gates contributed by CARU. This enables the attention
mechanism to focus on the current context of target sentences, enhancing translation quality.
To further improve training convergence, we introduce new normalization methods that
enhance the gradient of attention weights during processing. These methods, particularly
L1 (exp( Nx )) normalization, significantly speed up training convergence. Additionally,
CAEmbed uses CARU to refine embeddings through a partial recurrent layer, which per-
forms a short recurrent within the middle of sentences instead of receiving the whole
sequence, allowing it to handle the predicate with local attention and avoid long-term data
transfer between recurrent layers. This enhances the source representation associated with
previously decoded states, especially for phrases. Our in-depth analysis reveals that our
model effectively addresses long-term dependencies between sentences, further improving
translation quality. Experimental results demonstrate that CAAtt and CAEmbed outperform
state-of-the-art models, leading to improved BLEU scores. The utilization of the proposed
normalization methods also contributes to faster convergence during training. These find-
ings indicate that the attention mechanism architectures can be greatly enhanced using our
proposed methods, resulting in a noticeable increase in accuracy.
In our forthcoming study, we will refine the tone of NMT outputs to suit various
contextual expressions. The goal is to provide translations that are more natural, fluent,
contextual, and objective. Such adjustments help to enhance the quality of the translation,
making it more akin to human expression. Therefore, a more formal and objective tone
is preferred in business documents, while a more casual and relaxed tone is suitable for
informal chat dialogues.
Author Contributions: Conceptualization, K.-H.C. and S.-K.I.; methodology, K.-H.C. and S.-K.I.;
software, K.-H.C.; validation, S.-K.I.; formal analysis, K.-H.C. and S.-K.I.; investigation, K.-H.C.;
resources, K.-H.C. and S.-K.I.; data curation, K.-H.C. and S.-K.I.; writing—original draft preparation,
K.-H.C.; writing—review and editing, S.-K.I.; visualization, K.-H.C.; supervision, S.-K.I.; project
administration, S.-K.I.; funding acquisition, S.-K.I. All authors have read and agreed to the published
version of the manuscript.
Funding: This work is supported by the Macao Polytechnic University (Research Project RP/FCA-
06/2023).
Data Availability Statement: Data are contained within the article.
Acknowledgments: We would like to thank Laurie Cuthbert for English language editing.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Wang, X.; Lu, Z.; Tu, Z.; Li, H.; Xiong, D.; Zhang, M. Neural Machine Translation Advised by Statistical Machine Translation. In
Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [CrossRef]
2. Chan, K.H.; Ke, W.; Im, S.K. CARU: A Content-Adaptive Recurrent Unit for the Transition of Hidden State in NLP. In Neural
Information Processing; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 693–703. [CrossRef]
3. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215. [CrossRef]
4. Li, J.; Xiong, D.; Tu, Z.; Zhu, M.; Zhang, M.; Zhou, G. Modeling Source Syntax for Neural Machine Translation. In Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017 ;
pp. 688–697. [CrossRef]
5. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014,
arXiv:1409.0473. [CrossRef]
6. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.
arXiv 2017, arXiv:1706.03762. [CrossRef]
7. Liu, J.; Zhang, Y. Attention Modeling for Targeted Sentiment. In Proceedings of the 15th Conference of the European Chapter of
the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017. [CrossRef]
8. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings
of the 40th Annual Meeting on Association for Computational Linguistics—ACL '02, Philadelphia, PA, USA, 7–12 July 2002.
[CrossRef]
Mathematics 2024, 12, 997 18 of 19
9. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations
using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [CrossRef]
10. Wang, X.X.; Zhu, C.H.; Li, S.; Zhao, T.J.; Zheng, D.Q. Neural machine translation research based on the semantic vector of the
tri-lingual parallel corpus. In Proceedings of the 2016 International Conference on Machine Learning and Cybernetics (ICMLC),
Jeju, Republic of Korea, 10–13 July 2016. [CrossRef]
11. Garg, S.; Peitz, S.; Nallasamy, U.; Paulik, M. Jointly Learning to Align and Translate with Transformer Models. In Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–9 November 2019. [CrossRef]
12. Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [CrossRef]
13. Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421.
[CrossRef]
14. Fan, H.; Zhang, X.; Xu, Y.; Fang, J.; Zhang, S.; Zhao, X.; Yu, J. Transformer-based multimodal feature enhancement networks for
multimodal depression detection integrating video, audio and remote photoplethysmograph signals. Inf. Fusion 2024, 104, 102161.
[CrossRef]
15. Huang, P.Y.; Liu, F.; Shiang, S.R.; Oh, J.; Dyer, C. Attention-based Multimodal Neural Machine Translation. In Proceedings of
the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, 11–12 August 2016; pp. 639–645.
[CrossRef]
16. Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; Li, H. Modeling Coverage for Neural Machine Translation. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 76–85. [CrossRef]
17. Kazimi, M.B.; Costa-jussà, M.R. Coverage for Character Based Neural Machine Translation. Proces. Del Leng. Nat. 2017, 59, 99–106.
18. Cheng, R.; Chen, D.; Ma, X.; Cheng, Y.; Cheng, H. Intelligent Quantitative Safety Monitoring Approach for ATP Using LSSVM
and Probabilistic Model Checking Considering Imperfect Fault Coverage. IEEE Trans. Intell. Transp. Syst. 2023, Early Access.
19. Mi, H.; Sankaran, B.; Wang, Z.; Ittycheriah, A. Coverage Embedding Models for Neural Machine Translation. In Proceedings of
the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 955–960.
[CrossRef]
20. Douzon, T.; Duffner, S.; Garcia, C.; Espinas, J. Long-Range Transformer Architectures for Document Understanding. In Document
Analysis and Recognition—ICDAR 2023 Workshops; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023;
pp. 47–64. [CrossRef]
21. Tang, G.; Müller, M.; Rios, A.; Sennrich, R. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation
Architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium,
31 October–4 November 2018; pp. 4263–4272. [CrossRef]
22. Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; Smola, A. Neural Machine Translation with Recurrent Attention Modeling. In Proceedings of
the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017;
pp. 383–387. [CrossRef]
23. Mondal, S.K.; Zhang, H.; Kabir, H.D.; Ni, K.; Dai, H.N. Machine translation and its evaluation: A study. Artif. Intell. Rev. 2023, 56,
10137–10226. [CrossRef]
24. Cohn, T.; Hoang, C.D.V.; Vymolova, E.; Yao, K.; Dyer, C.; Haffari, G. Incorporating Structural Alignment Biases into an Attentional
Neural Translation Model. In Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016. [CrossRef]
25. Rosendahl, J.; Herold, C.; Petrick, F.; Ney, H. Recurrent Attention for the Transformer. In Proceedings of the Second Workshop on
Insights from Negative Results in NLP, Online and Punta Cana, Dominican Republic, 1 November 2021; pp. 62–66. [CrossRef]
26. Yazar, B.K.; Şahın, D.Ö.; Kiliç, E. Low-Resource Neural Machine Translation: A Systematic Literature Review. IEEE Access 2023,
11, 131775–131813. [CrossRef]
27. Zhang, B.; Xiong, D.; Su, J. Neural Machine Translation with Deep Attention. IEEE Trans. Pattern Anal. Mach. Intell. 2020,
42, 154–163. [CrossRef] [PubMed]
28. Vishnu, P.R.; Vinod, P.; Yerima, S.Y. A Deep Learning Approach for Classifying Vulnerability Descriptions Using Self Attention
Based Neural Network. J. Netw. Syst. Manag. 2021, 30, 9. [CrossRef]
29. Sethi, N.; Dev, A.; Bansal, P.; Sharma, D.K.; Gupta, D. Enhancing Low-Resource Sanskrit-Hindi Translation through Deep
Learning with Ayurvedic Text. ACM Trans. Asian -Low-Resour. Lang. Inf. Process. 2023. [CrossRef]
30. Shan, Y.; Feng, Y.; Shao, C. Modeling Coverage for Non-Autoregressive Neural Machine Translation. In Proceedings of the 2021
International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021. [CrossRef]
31. Zhou, L.; Zhang, J.; Zong, C. Improving Autoregressive NMT with Non-Autoregressive Model. In Proceedings of the First
Workshop on Automatic Simultaneous Translation, Seattle, WA, USA, 9–10 July 2020. [CrossRef]
32. Wu, L.; Tian, F.; Qin, T.; Lai, J.; Liu, T.Y. A study of reinforcement learning for neural machine translation. arXiv 2018,
arXiv:1808.08866.
33. Aurand, J.; Cutlip, S.; Lei, H.; Lang, K.; Phillips, S. Deep Q-Learning for Decentralized Multi-Agent Inspection of a Tumbling
Target. J. Spacecr. Rocket. 2024, 1–14.. [CrossRef]
Mathematics 2024, 12, 997 19 of 19
34. Kumari, D.; Ekbal, A.; Haque, R.; Bhattacharyya, P.; Way, A. Reinforced nmt for sentiment and content preservation in
low-resource scenario. Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 20, 1–27. [CrossRef]
35. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw.
1994, 5, 157–166. [CrossRef]
36. Trinh, T.H.; Dai, A.M.; Luong, M.T.; Le, Q.V. Learning Longer-term Dependencies in RNNs with Auxiliary Losses. arXiv 2018,
arXiv:1803.00144. [CrossRef]
37. Houdt, G.V.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955.
[CrossRef]
38. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
39. Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder–Decoder
Approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha,
Qatar, 25 October 2014. [CrossRef]
40. Dey, R.; Salem, F.M. Gate-variants of Gated Recurrent Unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th
International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017. [CrossRef]
41. Zhang, B.; Xiong, D.; Xie, J.; Su, J. Neural Machine Translation With GRU-Gated Attention Model. IEEE Trans. Neural Netw. Learn.
Syst. 2020, 31, 4688–4698. [CrossRef] [PubMed]
42. Cao, Q.; Xiong, D. Encoding Gated Translation Memory into Neural Machine Translation. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [CrossRef]
43. Chan, K.H.; Im, S.K.; Ke, W. Multiple classifier for concatenate-designed neural network. Neural Comput. Appl. 2021, 34, 1359–1372.
[CrossRef]
44. Ranjan, R.; Castillo, C.D.; Chellappa, R. L2-constrained Softmax Loss for Discriminative Face Verification. arXiv 2017,
arXiv:1703.09507. [CrossRef]
45. Lita, L.V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. tRuEcasIng. In Proceedings of the 41st Annual Meeting on Association for
Computational Linguistics—ACL’03, Sapporo, Japan, 7–12 July 2003. [CrossRef]
46. Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [CrossRef]
47. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing
Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14
December 2019 ; Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.:
New York, NY, USA, 2019; Volume 32.
48. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [CrossRef]
49. Takase, S.; Kiyono, S. Lessons on Parameter Sharing across Layers in Transformers. arXiv 2021, arXiv:2104.06022. [CrossRef]
50. Takase, S.; Kiyono, S. Rethinking Perturbations in Encoder-Decoders for Fast Training. In Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT
2021, Online, 6–11 June 2021. [CrossRef]
51. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer
Learning with a Unified Text-to-Text Transformer. arXiv 2019, arXiv:1910.10683. [CrossRef]
52. Kumar, G.; Foster, G.; Cherry, C.; Krikun, M. Reinforcementearning based curriculum optimization for neural machine translation.
arXiv 2019, arXiv:1903.00041.
53. Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the
5th Workshop on Vision and Language, Berlin, Germany, 12 August 2016. [CrossRef]
54. Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium
onInformation Theory, ISIT 2004, Chicago, IL, USA, 27 June–2 July 2004. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.