0% found this document useful (0 votes)
63 views15 pages

ACL - 2022 - Yanzhe Zhang - Continual Sequence Generation With Adaptive Compositional Modules

This paper proposes a method for continual sequence generation that can adaptively add new modules or reuse existing modules based on task similarity. The method aims to maximize knowledge transfer through module reuse while adding new modules to mitigate interference between tasks and forgetting. Experiments on longer sequences of diverse generation tasks show the approach outperforms baselines in performance and parameter efficiency.

Uploaded by

Yun Zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views15 pages

ACL - 2022 - Yanzhe Zhang - Continual Sequence Generation With Adaptive Compositional Modules

This paper proposes a method for continual sequence generation that can adaptively add new modules or reuse existing modules based on task similarity. The method aims to maximize knowledge transfer through module reuse while adding new modules to mitigate interference between tasks and forgetting. Experiments on longer sequences of diverse generation tasks show the approach outperforms baselines in performance and parameter efficiency.

Uploaded by

Yun Zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Continual Sequence Generation with Adaptive Compositional Modules

Yanzhe Zhang Xuezhi Wang Diyi Yang


Georgia Institute of Technology Google Georgia Institute of Technology
[email protected] [email protected] [email protected]

Abstract

Continual learning is essential for real-world


deployment when there is a need to quickly
adapt the model to new tasks without forget-
ting knowledge of old tasks. Existing work
on continual sequence generation either al-
ways reuses existing parameters to learn new
tasks, which is vulnerable to catastrophic for-
getting on dissimilar tasks, or blindly adds
new parameters for every new task, which
could prevent knowledge sharing between sim- Figure 1: Comparison between previous methods (a
ilar tasks. To get the best of both worlds, and b) and our proposed method (c), from a multi-layer
in this work, we propose continual sequence transformer model perspective. The blue blocks re-
generation with adaptive compositional mod- fer to learnable modules and the yellow blocks refer
ules to adaptively add modules in transformer to frozen pretrained modules . a: retrain the whole
architectures and compose both old and new model every time when new tasks arrive. b: insert task-
modules for new tasks. We also incorporate specific modules for each task, while keeping the pre-
pseudo experience replay to facilitate knowl- trained model frozen. c: detect reusable old modules
edge transfer in those shared modules. Exper- and add new modules adaptively.
iment results on various sequences of genera-
tion tasks show that our framework can adap-
tively add modules or reuse modules based on and facilitating knowledge transfer (Lopez-Paz and
task similarity, outperforming state-of-the-art Ranzato, 2017), however, continual sequence gen-
baselines in terms of both performance and pa- eration is relatively under-investigated.
rameter efficiency. We make our code pub- Comparing to continual learning on text classifi-
lic at https://github.com/GT-SALT/
cation and question answering (Wang et al., 2020;
Adaptive-Compositional-Modules.
Holla et al., 2020; Huang et al., 2021), continual
1 Introduction sequence generation is more challenging, since the
output is no longer discrete labels but sequential
Current state-of-the-art language generation mod- text data in different styles/domains. Based on how
els can achieve great performance on a wide range to retain old knowledge while learning new tasks,
of sequence generation tasks (Radford et al., 2019; current continual sequence generation methods can
Lewis et al., 2020) with a static data distribution. be categorized into two types. The first one con-
However, real-world scenarios are often changing tinually learns new tasks on old parameters (Fig
which requires the model to learn with dynamic 1 a), with approaches like experience replay (Sun
data distributions. In such cases of data distribu- et al., 2019; Chuang et al., 2020) and regulariza-
tions shift, current generation models often suf- tion (Mi et al., 2020) to maintain old knowledge.
fer from catastrophic forgetting (Sun et al., 2019): However, since all tasks share the same parameters,
models completely and abruptly forget previously some degree of interference between tasks is un-
learned information upon learning new information. avoidable. Another line of work continually inserts
Continual learning (CL) (Ring, 1998; Thrun, 1998) new task-specific modules (adapters proposed by
has been introduced to improve model’s ability to Houlsby et al., 2019) into every transformer layer
learn tasks in a stream by mitigating forgetting for every new task while freezing pretrained mod-
3653
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 3653 - 3667
May 22-27, 2022 c 2022 Association for Computational Linguistics
els and modules used by old tasks (Fig 1 b, Madotto 2 Related Work
et al., 2021), which might prevent knowledge trans-
fer between tasks and introduce possible parameter Continual Learning Without allocating new pa-
redundancy. In this work, we aim to get the best of rameters for new tasks, prior work mainly lever-
both worlds: how to encourage the models to reuse ages experience replay (Wang et al., 2019; Sun
modules from previous tasks as much as possible et al., 2019) and regularization to mitigate catas-
and to only add new modules if needed? trophic forgetting. In experience replay, models
are retrained on old examples from previous tasks
To this end, we propose continual sequence gen- while learning new tasks. Those old examples are
eration with adaptive compositional modules, as usually stored in a fixed size (Mi et al., 2020) or
shown in Fig 1 c. Specifically, we introduce a expanding (Huang et al., 2021) memory buffer. Be-
two-stage process for every new coming task: a sides replaying old examples, regularization on the
decision stage and a training stage. During deci- hidden states (Wang et al., 2019; Han et al., 2020;
sion stage, we decide which modules to reuse and Huang et al., 2021) or parameters (Mi et al., 2020)
whether we need to add a new module. During could be further added to prevent severe distortion.
training stage, the model architecture is determined Another line of work is to create new parameters
and fixed. We augment new task’s training process for new tasks while freezing parameters used by
with pseudo experience replay (Sun et al., 2019) to old tasks. In computer vision, progressive neu-
further mitigate forgetting and facilitate knowledge ral network (Rusu et al., 2016) continually adds
transfer in those shared layers. Our model archi- new branches of parameters for new image classi-
tecture is adaptive, as it can automatically add new fication tasks with lateral connections to facilitate
modules for dissimilar tasks and reuse modules forward knowledge transfer. Dynamically expand-
for similar tasks, thus making it robust to different able network (Yoon et al., 2017) expands neural
scenarios of continual learning. Furthermore, it is networks at neuron level by using regularization to
compositional because for every new task, our new restrict the number of added neurons. While allo-
architecture is composed of reused modules from cating a big network in advance, PackNet (Mallya
old tasks and newly added modules, which allows and Lazebnik, 2018) continually assigns a param-
knowledge reuse and transfer. eter subset to each task by network pruning.Li
et al. (2019) employ neural architecture search (Liu
To evaluate the above adaptive compositional et al., 2018) to optimize on new task’s structure
framework, we experiment with four representative before learning new tasks. In language domain,
sequence generation tasks following prior work prior work often utilizes adapter (Houlsby et al.,
(Sun et al., 2019; Chuang et al., 2020): natural 2019; Madotto et al., 2021; Ermis et al., 2022),
language generation, SQL query generation, sum- which could be considered as task-specific MLPs
marization and task-oriented dialogue arriving in inserted into frozen transformer layers. However,
a stream. Different from prior work that only tests since all adapter modules are designed for only one
their methods on very short task sequences or long specific task, no knowledge transfer is directly al-
task sequences with similar tasks only, we validate lowed in this case. Extra modules like attention
our approach on longer sequences containing di- module (Pfeiffer et al., 2021), capsule network (Ke
verse tasks with different levels of similarity. We et al., 2021), and hypernetworks (Jin et al., 2021)
believe this is a suitable scenario to validate both are demonstrated beneficial for knowledge transfer,
the model’s ability to mitigate forgetting and its but they need to introduce extra parameters and fail
ability to facilitate knowledge transfer. In summary, to consider any reusable or compositional modules.
this work makes two key contributions: (1) We Avoiding privacy concerns, this work also fol-
propose continual sequence generation with adap- lows a line of work that doesn’t store real examples
tive compositional modules, to maximize knowl- for experience replay, such as generating examples
edge transfer via module-reusing while adaptively by GAN (Atkinson et al., 2018), synthesizing ex-
adding new modules to mitigate task-interference amples (Xu et al., 2022) by model-inversion (Smith
and catastrophic forgetting. (2) Experiments with et al., 2021b), and using unlabeled data in the learn-
longer and more task sequences show that our ap- ing environment (Smith et al., 2021a). In language
proach outperformed baselines with higher param- domain, LAMOL (Sun et al., 2019) trains the lan-
eter efficiency. guage model to solve current tasks and generate
3654
current training examples simultaneously, then this tures for new coming tasks.
model can generate “pseudo” old examples for re-
play before any new tasks. We adopt this pseudo 3 Background
experience replay along to alleviate the forgetting Continual Generation Formulation Assuming
in the shared modules of our approach. multiple sequence generation tasks {T1 ...Tn } ar-
rive in a stream, each task Ti has a set of train-
Continual Learning for Sequence Generation
ing examples {P1i , P2i ..., Pki }, where Pji denotes a
Building on an auto-regressive language model,
(input, output) pair in Task i. While learning on
LAMOL (Sun et al., 2019) makes initial explo-
task Ti (i > 2), we have no access to examples
ration on continual sequence generation. On the
from previous tasks. The final goal is to optimize
basis of LAMOL, knowledge distillation (Chuang
the model’s average performance on all tasks after
et al., 2020; Sun et al., 2020) is shown to be ef-
training on the whole sequence.
fective via improving knowledge transfer while
changing tasks. ARPER (Mi et al., 2020) combines Finetuning In order to integrate different se-
regularization on parameters (Kirkpatrick et al., quence generation tasks into a single framework,
2017) with prioritized exemplar replay. Keeping we use finetuning as a general strategy. On the
the pretrained model frozen, Madotto et al. (2021) basis of an autoregressive language model, the
added task-specific modules for each task together core idea is to feed the model input and train the
with a perplexity-based classifier, without taking model to generate the corresponding output sub-
into account the potential for knowledge transfer sequently. To distinguish between tasks, we add
between different tasks. Instead of blindly adding an extra question following every input to de-
new modules for new tasks, our approach can de- scribe the purpose of each task. For example, the
tect reusable modules and strategically add new question for natural language generation tasks is
adapter modules in those layers in which reusing What is the natural language form? Formally, for
old modules would lead to severe forgetting. With- each (input, question, output) triple, the model
out introducing extra knowledge transfer modules, is optimized to generate the corresponding output
our approach enables knowledge transfer via mod- given input and question:
ule sharing. n
X
Lf inetune (x) = − log P (xt |x<t )
Task-specific Modules Traditional finetuning
t=m+1
approaches (Peters et al., 2018; Devlin et al., 2019;
Radford et al., 2019) usually modify all the param- where x = {x1 , ..., xn } denotes the concatenation
eters in large pretrained modules while learning of input, question and output, and {x1 , ..., xm }
downstream tasks. Recently, a line of work has refers to input and question.
been proposed to improve the parameter-efficiency
of finetuning by inserting task-specific modules Adapter The module used in our framework
into freezing pretrained models. Adapter (Houlsby refers to adapter (Houlsby et al., 2019), which is a
et al., 2019) inserts MLP layers into each trans- task-specific module inserted into each frozen pre-
former layer. PrefixTuning (Li and Liang, 2021) trained transformer layers (Vaswani et al., 2017).
prepends key-value pairs to each transformer layer In addition to residual connection (He et al., 2016)
as activations. Prior work also shows that these and layer normalization (Ba et al., 2016), one trans-
task-specific modules might benefit from a more former layer contains two primary sub-layers: an at-
adaptive usage. For example, AdapterDrop (Rücklé tention layer and a feed forward layer. One adapter
et al., 2021) shows that removing adapters from module consists of two multi-layer perceptrons
lower transformer layers can almost maintain the (M LP ), one (M LPM H ) following the multi-head
original performance while reducing computational attention layer and one (M LPF F ) following the
overhead. Guo et al. (2021) leveraged latent vari- feed forward layer.
ables to decide whether to skip adapter modules
4 Two-Stage Methods
in certain transformer layers to speed up decoding.
However, our approach goes beyond the notion of Motivated by prior continual sequence generation
“task-specific”, recomposes reusable modules from work (Madotto et al., 2021) that uses Adapter
different tasks, and learns compositional architec- (Houlsby et al., 2019) to insert new adapter module
3655
into every transformer layer for each new coming pretrained model and all old modules are frozen,
task, we propose to strategically decide whether and only mixing coefficients and newly added mod-
we can reuse some adapter modules from old tasks ules will be learned. (ii) Calculating the weighted
before training on each new coming task, in a two- average is a convenient approximation of using one
stage manner: decision stage and training stage, adapter at a time, which is the real setting during
where the former determines the architecture for training stage and inference. (iii) Comparing to
new tasks and the later trains the model. other baselines in Figure 1, introduced decision
stage to decide the architecture does introduce ex-
4.1 Decision Stage tra computation, while computation of different
The decision stage aims to answer two questions: MLPs at one position is parallelizable to speed up.
do we need to add a new module in this layer? If To avoid the learned weight coefficient
not, which old modules should we reuse? Inspired λ1,l , . . . , λk+1,l to be too close to a uniform dis-
by interpolation-based data augmentation (Chen tribution in certain layers, we further add an addi-
et al., 2020, 2021) and neural architecture search tional regularization term to Ltrain , which is the
(Liu et al., 2018), we utilize Hidden State Mixing sum of entropy of every discrete probability distri-
for module selection. Assume that there are several bution [λ1,l , . . . , λk+1,l ]:
modules as potential candidates to be selected, after
k+1
calculating their output separately, we calculate XX
their weighted average as the overall output, which Lentropy = γ −λi,l log(λi,l )
l i=1
is then passed to the next part of the model (See
the left part in Figure 2). After training the entire where γ is a coefficient tuned as a hyper-parameter.
model end-to-end, we assume that the module with In this stage, a trivial solution could be allocating
the largest learned weight is the most useful one, a new module in every layer regardless of whether
and thus will be selected for the reuse. old modules are reusable. To avoid this trivial so-
Formally, assume that we already have inserted lution and reuse shareable modules as much as
k modules into the lth transformer layer, each possible, we design a prior using the initialization
1,l 1,l
consisting of two MLPs: (M LPM H , M LPF F )... of the coefficient weights. For every l, c1,l ...ck,l
k,l k,l
(M LPM H , M LPF F ). At the beginning of is initialized to c (c > 0), while ck+1,l is initial-
decision stage, we add one more module ized to −c. After softmax, the weight of each old
k+1,l k+1,l
(M LPM H , M LPF F ). Given these learnable
module is e2c times the weight of the new module,
weight coefficients [λ1,l , . . . , λk+1,l ], multi-head increasing the tendency to reuse old modules.
attention layer output olmh , the feed forward layer
4.2 Training Stage
output olf f , we mix the hidden states as follow:
We further incorporate pseudo experience replay
k+1 (Sun et al., 2019) to mitigate forgetting and facil-
t,l
X
hlmh = λt,l M LPM l
H (omh ) itate knowledge transfer in those shared modules.
t=1 The main idea is to teach a generative model to
k+1 solve current task and to generate current task’s
λt,l M LPFt,lF (olf f )
X
hlf f = examples simultaneously. Then before training
t=1
on each new task, we can generate a set of pseudo
where both hlmh and hlf f are then fed into old examples and replay them during training.
their Thus, in addition to the finetuning loss to solve
Pk+1following Add & Norm layers. To ensure each task, we introduce an extra loss Lgen for
t=1 λt,l = 1, we use softmax function to pro-
duce λ1,l , . . . , λk+1,l from c1,l , . . . , ck+1,l : the model to generate current task’s examples.
Formally, given the whole sequence of x =
eci,l {input, question, output}, we first add a special
λi,l = Pk+1 ,i = 1...k + 1
ct,l token [GEN] at the beginning of x to form a new
t=1 e
sequence x0 , and then optimize the model as fol-
Using this mixing approach in every transformer lows:
layer, we optimize our model using Ltrain (see Sec n+1
4.2) for the new task and find the most suitable mod- Lgen (x0 ) =
X
− log P (x0t |x0<t )
ules for each layer. Note that (i) In this process, the t=1
3656
Figure 2: Our proposed model architecture with adaptive compositional modules for transformer layers. Assume
after learning three tasks (1, 2, 3), we have one module for task 1, and another for task 2 and 3 in this layer. Left:
During decision stage for task 4, we first insert a new module at this position, then all inserted modules will be used
for selection using hidden state mixing. Right: Assume that we finally decide to add one module at this position,
then each task would use its own architecture during training stage and inference.

Note that we use different special tokens for dif- we test our proposed approach under two common
ferent tasks, thus we can generate examples for scenarios: (1) CL on similar tasks: in this case,
specified tasks afterwards. Combining with the the new coming tasks often share the same task
finetune loss, the overall training loss is: pattern with learned tasks, but are from different
domains. We use E2ENLG (Novikova et al., 2017)
Ltrain = Lf inetune + ηLgen and four different domains (restaurant, hotel, tv,
laptop) from RNNLG (Wen et al., 2015) to form
where η is the weight for the Lgen loss. five similar tasks. Then we use four different or-
Once our model has the ability to generate ders of these tasks as our testing task sequences.
“pseudo“ examples from old tasks, another question (2) CL on dissimilar tasks: in this case, the distribu-
is When to generate “pseudo“ examples? Since tion shift between new tasks and old tasks could be
those “pseudo“ examples are for shared modules relatively large, so the major challenge is to retain
between old tasks and the current task, we only old knowledge as much as possible while learning
generate them while some old modules are reused new tasks. In this case, we further incorporate Wik-
for the current task. In that case, we train our model iSQL (SQL query generation, Zhong et al., 2017),
using Ltrain on the current dataset together with CNN/DailyMail (news article summarization See
the generated examples. Otherwise, there is no et al., 2017), MultiWOZ (semantic state sequence
need for pseudo experience replay and we just train generation (Budzianowski et al., 2018)) into our
our model using Ltrain on the current dataset. task sequences1 . We randomly pick four different
orders as our testing task sequences. In total, we
5 Experiments use eight different task sequences (Table 1) to eval-
uate our models. The statistics/metrics for each
5.1 Datasets
dataset and the finetuing results are in Appendix A.
Following Sun et al. (2019) and Chuang et al.
(2020), we evaluate our approach on four represen-
1
tative sequence generation tasks: natural language We use “e2e” for E2ENLG, “rest” for RNNLG (restau-
rant), “hotel” for RNNLG (hotel), “tv” for RNNLG (tv), “lap-
generation, SQL query generation, summarization top” for RNNLG (laptop), “wiki” for WikiSQL, “cnn” for
and task-oriented dialogue modeling. Specifically, CNN/DailyMail, “woz” for MultiWOZ.

3657
Order Task Sequence yond these, we also provide (i) evaluation results
using geometric mean and (ii) final performance
1 e2e rest hotel tv laptop of each task in Appendix A. Table 2 summarizes
2 laptop tv hotel rest e2e the final performance on all eight task sequences.
3 rest tv e2e laptop hotel We observed that finetuning sequentially suffered
4 hotel e2e rest laptop tv from very severe forgetting, no matter on similar
5 woz cnn e2e rest hotel or dissimilar tasks, highlighting the importance of
6 e2e wiki hotel woz rest continual learning work. Though EWC can signifi-
7 hotel e2e woz wiki cnn cantly increase the performance of finetuning, its
8 cnn hotel wiki e2e woz performance is still far behind LAMOL, highlight-
ing the importance of experience replay.
Table 1: Eight random different task sequences. The
first 4 includes different orders of similar tasks, the last For sequences containing similar tasks, the
4 includes different orders including dissimilar tasks. performance of Adapter+CL is inferior to
Adapter+LAMOL even with more learnable pa-
rameters. This indicates that sharing parameters
5.2 Baselines and experience replay can further facilitate knowl-
We compare our proposed model with the follow- edge transfer when tasks are similar. On the
ing baselines: (i) Finetune (Yogatama et al., 2019): premise of pseudo experience replay, our method
We finetuned GPT-2 model on several tasks sequen- performs better than Adapter+LAMOL, demon-
tially. (ii) EWC (Kirkpatrick et al., 2017) added strating the effectiveness of our adaptive composi-
regularization on parameters according to their im- tional architecture. Our approach also achieves a
portance to old tasks. (iii) LAMOL (Sun et al., much higher parameter efficiency than Adapter+CL
2019) finetuned the whole GPT-2 model contin- and Adapter+Drop. For sequences containing
ually with the help of pseudo experience replay. dissimilar tasks where the transferable knowl-
(iv) Adapter+CL (Madotto et al., 2021) inserted edge is limited and parameter sharing might cause
adapter (Houlsby et al., 2019) modules into every degradation, Adapter+CL and Adapter+Drop seem
GPT-2’s layer for each task. (v) Adapter+Drop more robust compared to Adapter+LAMOL and
(Rücklé et al., 2021): We removed all those adapter LAMOL, since they avoid catastrophic forgetting
modules from the first three layers in GPT-2 based by parameter isolation. Using a similar number
on Adapter+CL. (vi) Adapter+LAMOL: We only of parameters to Adapter+Drop, our method out-
inserted adapter modules into every transformer performs Adapter+CL consistently on all task se-
layer for the first task, then used those adapter quences, confirming that our method can prevent
modules to learn the whole the task sequence with interference between dissimilar tasks while reduc-
pseudo experience replay. Note that ARPER (Mi ing parameter redundancy.
et al., 2020) also tackles the problem of continual
sequence generation, but it needs an extra memory 6.1 Ablation Studies
buffer to store examples from old tasks, which is We randomly selected task sequence #1 from simi-
not comparable with ours. lar tasks and sequence #8 from sequences of dis-
Implementation Details We use GPT-2 (Rad- similar tasks for our ablation studies.
ford et al., 2019) in HugginceFace Transformers Importance of Each Component To examine
(Wolf et al., 2020) as our backbone and adapter im- the importance of each component in our method,
plementation by AdapterHub (Pfeiffer et al., 2020). we experiment with different settings: not using
More details can be found in Appendix A. entropy loss (w/o Entropy Loss), initializing all
weight coefficients with zero (w/o Weight Ini), and
6 Results and Analysis
not replaying pseudo data (w/o Pseudo ER). As
To evaluate the overall performance on all tasks, shown in Table 3, we found that (i) After remov-
we use the mean of all tasks’ performance score fol- ing entropy loss, the performance on sequence #1
lowing Sun et al. (2019); Mi et al. (2020); Madotto is almost maintained by using more parameters.
et al. (2021). For each scenario (similar tasks and Meanwhile, the performance on sequence #8 drops
dissimilar tasks), we report the average of mean significantly while using the same number of pa-
scores on all sequences as an overall metric. Be- rameters. This observation suggests that the en-
3658
Adapter Adapter Adapter
Methods Finetune EWC LAMOL Ours
+CL +Drop +LAMOL
Pseudo
7 7 3 7 7 3 3
Experience Replay
#1 43.0 56.9 66.3 64.2 63.9 65.9 66.1
#2 37.0 47.9 67.0 64.2 63.9 66.2 66.5
Similar Tasks
#3 51.7 61.4 66.6 64.2 63.9 65.6 65.8
#4 45.0 58.3 66.6 64.2 63.9 65.2 65.7
Avg Performance 44.2 56.2 66.6 64.2 63.9 65.7 66.0
Avg Learnable Para. 124.45M 124.45M 124.45M 8.95M 6.71M 1.79M 2.44M
#5 33.6 37.5 57.0 57.5 57.4 54.3 58.2
#6 32.6 37.9 62.5 64.9 64.5 62.2 65.9
Dissimilar Tasks
#7 19.7 37.5 56.7 57.3 56.7 54.6 58.3
#8 26.3 38.8 56.8 57.3 56.7 53.8 58.2
Avg Performance 28.1 37.9 58.3 59.3 58.8 56.2 60.1
Avg Learnable Para. 124.45M 124.45M 124.45M 8.95M 6.71M 1.79M 6.60M

Table 2: The mean of final performance score on all tasks. We use two random seeds for each task sequence.
Note that the final performance of Adapter+CL and Adapter+Drop is not affected by task ordering within the same
group of tasks. For each sequence, we mark the best representation in bold, where LAMOL is not compared
due to the difference in the order of magnitude of the learnable parameters. For each scenario, the p-values of
paired t-test between 8 numbers of our approach and the second highest comparable baseline is smaller than 0.05,
demonstrating significant improvement.

Sequence #1 Sequence #8 Adapter Adapter


Method Length Ours
Avg Avg L.P. Avg Avg L.P. +CL +LAMOL
Ours 66.1 2.24M 58.2 6.49M 2 Tasks(#1) 56.8 (+0.0) 57.5 (+0.8) 57.7 (+0.9)
w/o Entropy loss 66.1 2.54M 57.6 6.49M 3 Tasks(#1) 59.5 (+0.0) 60.3 (+0.6) 60.1 (+0.5)
w/o Weight Ini 64.2 7.09M 57.7 8.65M 4 Tasks(#1) 62.3 (+0.0) 63.5 (+1.3) 63.7 (+1.6)
w/o Pseudo ER 43.2 2.08M 55.9 6.34M 5 Tasks(#1) 64.2 (+0.0) 65.9 (+2.0) 66.1 (+2.1)
2 Tasks(#8) 45.4 (+0.0) 46.2 (+1.3) 46.0 (+1.2)
Table 3: Ablation study on (i) entropy loss (ii) weight 3 Tasks(#8) 51.3 (+0.0) 51.9 (+0.8) 52.3 (+0.9)
4 Tasks(#8) 50.9 (+0.0) 49.7 (-1.7) 51.8 (+0.6)
initialization (iii) pseudo experience replay. The left 5 Tasks(#8) 57.3 (+0.0) 53.8 (-4.6) 58.2 (+0.5)
part includes results for sequence #1 while the right
part includes result for sequence #8. Note that “Avg“ Table 4: Impact of the task sequence length. Note
refers to the mean of performance score on all tasks and that “n Tasks(#i)” refers to after sequentially training
“Avg L.P.“ refers to the mean of learnable parameters. on the first n tasks in sequence #i, we report the mean
of performance score on those n tasks and the backward
transfer in parentheses.
tropy loss is beneficial to achieve a better trade-off
between adding parameters and maintaining good forgetting.
performance. (ii) When we initialize all weight co-
efficients with zero, there is no explicit tendency to Impact of Task Sequence Length Prior work
reuse old examples. In this case, many redundant in continual learning (Madotto et al., 2021; Huang
modules are created thus preventing knowledge et al., 2021) suggests that the differences in se-
transfer, which leads to performance drop on both quence length could influence the performance of
sequences. The drop on sequence #1 is more se- continual learning. To this end, we further inves-
vere due to there is more transferable knowledge tigated the impact of sequence length in Table 4,
between similar tasks. We therefore conclude that where we reported the average performance at ev-
weight initialization is important to enable knowl- ery step and calculated Backward Transfer follow-
edge transfer between similar tasks. (iii) Removing ing Lopez-Paz and Ranzato (2017):
pseudo experience replay leads to the most severe 1
performance drop on both sequences. Though our BW Tk = Ei=1...k−1 (Rk,i − Ri,i )
k−1
approach strategically detect which modules can be
reused, directly training them on new tasks without where Ri,j is the performance score on the jth task
protecting old knowledge will lead to catastrophic after training on the ith task.
3659
Figure 3: The growing process of our model on sequence: hotel e2e rest laptop tv. The 1st layer is shown
at the bottom and the 12th layer is at the top of each figure. Note that here we only depict the architecture growing
process of our inserted modules: (i) Each rectangle represents a module added in that specific transformer layer.
(ii) Each module is painted with the corresponding color if it is used by a task. (iii) Modules with multiple colors
are shared by multiple tasks.

We found that, on sequence #1, vey the information provided in the input, suffering
Adapter+LAMOL and our method consis- from generating grammar mistakes and missing
tently outperform Adapter+CL in all stages, which key points. This could be attributed to the inter-
could be explained by better knowledge transfer ference from learning new coming tasks. While
between multiple tasks. Beyond that, our method Adapter+CL successfully mitigate this problem
outperforms Adapter+LAMOL in most cases, by parameter isolation, our approach works sim-
demonstrating the benefits of adaptively adding ilarly using less parameters and generates better
modules. On sequence #8, Adapter+LAMOL sequences without redundant information.
struggles when the length of task sequence
becomes longer. As more and more tasks arrive, 6.2 The Growth of Compositional Modules
the impact of task dissimilarity and distribution To illustrate the process of adding/reusing modules,
shift gets larger that pseudo experience replay we depict the model architecture at each stage in
cannot cope with. In that case, there is limited Fig 3 using sequence #4, which is the most chal-
backward transfer but severe forgetting. In contrast, lenging sequence containing similar tasks accord-
Adapter+CL and our method demonstrate their ing to Table 2. Since the similarity between the
robustness after learning more tasks in a stream. second task (e2e) and the first task (hotel) is low
Our method also outperforms Adapter throughout (see Figure 4 in Appendix A), our framework au-
the learning process, demonstrating we can enable tomatically learns to add extra adapter modules in
knowledge transfer even the similarity between layer {6, 8, 9, 10, 11} before training on the second
tasks is limited. task. When the third task (rest) arrives, given its
high similarity to the first task, our method cor-
Case Study We selected e2e in sequence #1 and rectly decides to reuse all modules used in the first
wiki in sequence #8 as two representative tasks to task. Interestingly, the architecture for the fourth
illustrate the final output generated by different ap- task is composed of shared modules with the first
proaches in Table 5. After training on the whole 3 tasks in layer {1, 2, 3, 4, 5, 7, 12}, shared module
sequence, Adapter+LAMOL cannot correctly con- with the second task in layer 6, shared the mod-
3660
E2E NLG (#1): name[Strada], eatType[coffee shop], area[city centre]
Reference There is a coffee shop in the city centre called the Strada.
Adapter+CL Strada serves coffee, is a nice coffee shop, in city centre.
Adapter+LAMOL Strada is a coffee shop serving city centre food
Ours Strada is a coffee shop located in the city centre.
WikiSQL (#8): which team has pick 13 in round 2 ?
Reference select team from table where round = 2 and pick = 13
Adapter+CL select team from table where pick = 13 and round = round 2
Adapter+LAMOL select team from table where round = 2 (missing: and pick = 13)
Ours select team from table where pick = 13 and round = 2

Table 5: Output comparison after training on sequence #1 and #8. We visualized e2e and wiki as two representative
tasks and color redundant information in red, missing information in blue and grammar mistakes in orange.

ule with the first and the third task in layer 8, and References
also added new modules for the fourth task in layer Craig Atkinson, Brendan McCane, Lech Szymanski,
{9, 10, 11}. For the fifth task, our method reuses and Anthony Robins. 2018. Pseudo-recursal: Solv-
all modules used by the fourth tasks due to their ing the catastrophic forgetting problem in deep neu-
high similarity. This demonstrates that our method ral networks.
is adaptive to different incoming tasks and is able to Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-
compose modules from different old tasks for new ton. 2016. Layer normalization.
tasks. We also provide a comparison in Appendix B
to demonstrate the effect of reusing modules from Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra-
different transformer layers. madan, and Milica Gašić. 2018. Multiwoz–a
large-scale multi-domain wizard-of-oz dataset for
7 Conclusion task-oriented dialogue modelling. arXiv preprint
arXiv:1810.00278.
This work examined continual sequence generation
with adaptive compositional modules, where we Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal,
proposed hidden state mixing to adaptively com- and Diyi Yang. 2021. An empirical survey of data
augmentation for limited data learning in nlp.
pose old and new modules for new tasks and uti-
lized pseudo experience replay to facilitate knowl- Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mix-
edge transfer. Experiments conducted on various text: Linguistically-informed interpolation of hid-
sequence generation tasks demonstrated that our den space for semi-supervised text classification. In
ACL, pages 2147–2157.
method achieves better performances with higher
parameter efficiency over previous state-of-the-art Yung-Sung Chuang, Shang-Yu Su, and Yun-Nung
baselines, both on similar task sequences and dis- Chen. 2020. Lifelong language knowledge distil-
similar task sequences. Our work is also subject lation. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
to a few limitations such as the introduced extra (EMNLP), pages 2914–2924.
training time. In the future, we plan to investigate
how to further speed up the decision stage more Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
efficiently and generalize the current framework to Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
more diverse NLP tasks such as text classification standing. In Proceedings of the 2019 Conference
and machine translation. of the North American Chapter of the Association
for Computational Linguistics: Human Language
Acknowledgment Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
We would like to thank the anonymous reviewers ation for Computational Linguistics.
for their helpful comments, and the members of Beyza Ermis, Giovanni Zappella, Martin Wistuba, and
Georgia Tech SALT group for their feedback. This Cedric Archambeau. 2022. Memory efficient contin-
work is funded in part by Salesforce and Cisco. ual learning for neural text classification.

3661
Junliang Guo, Zhirui Zhang, Linli Xu, Boxing Chen, James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
and Enhong Chen. 2021. Adaptive adapters: An Joel Veness, Guillaume Desjardins, Andrei A Rusu,
efficient way to incorporate bert into neural ma- Kieran Milan, John Quan, Tiago Ramalho, Ag-
chine translation. IEEE/ACM Transactions on Au- nieszka Grabska-Barwinska, et al. 2017. Over-
dio, Speech, and Language Processing, 29:1740– coming catastrophic forgetting in neural networks.
1751. Proceedings of the national academy of sciences,
114(13):3521–3526.
Xu Han, Yi Dai, Tianyu Gao, Yankai Lin, Zhiyuan Liu,
Peng Li, Maosong Sun, and Jie Zhou. 2020. Contin- Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
ual relation learning via episodic memory activation jan Ghazvininejad, Abdelrahman Mohamed, Omer
and reconsolidation. In Proceedings of the 58th An- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
nual Meeting of the Association for Computational 2020. BART: Denoising sequence-to-sequence pre-
Linguistics, pages 6429–6440, Online. Association training for natural language generation, translation,
for Computational Linguistics. and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Linguistics, pages 7871–7880, Online. Association
Sun. 2016. Deep residual learning for image recog- for Computational Linguistics.
nition. In 2016 IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 770– Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning:
778. Optimizing continuous prompts for generation.

Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and


Nithin Holla, Pushkar Mishra, Helen Yannakoudakis, Caiming Xiong. 2019. Learn to grow: A continual
and Ekaterina Shutova. 2020. Meta-learning with structure learning framework for overcoming catas-
sparse experience replay for lifelong language learn- trophic forgetting.
ing. CoRR, abs/2009.04891.
Hanxiao Liu, Karen Simonyan, and Yiming Yang.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, 2018. Darts: Differentiable architecture search.
Bruna Morrone, Quentin De Laroussilhe, Andrea arXiv preprint arXiv:1806.09055.
Gesmundo, Mona Attariyan, and Sylvain Gelly.
2019. Parameter-efficient transfer learning for NLP. David Lopez-Paz and Marc' Aurelio Ranzato. 2017.
In Proceedings of the 36th International Conference Gradient episodic memory for continual learning. In
on Machine Learning, volume 97 of Proceedings Advances in Neural Information Processing Systems,
of Machine Learning Research, pages 2790–2799. volume 30. Curran Associates, Inc.
PMLR.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
Yufan Huang, Yanzhe Zhang, Jiaao Chen, Xuezhi weight decay regularization. In International Con-
Wang, and Diyi Yang. 2021. Continual learning for ference on Learning Representations.
text classification with information disentanglement
based regularization. In Proceedings of the 2021 Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Se-
Conference of the North American Chapter of the ungwhan Moon, Paul Crook, Bing Liu, Zhou Yu,
Association for Computational Linguistics: Human Eunjoon Cho, Pascale Fung, and Zhiguang Wang.
Language Technologies, pages 2736–2746. 2021. Continual learning in task-oriented dialogue
systems. In Proceedings of the 2021 Conference on
Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. Empirical Methods in Natural Language Processing,
2019. What does bert learn about the structure of pages 7452–7467, Online and Punta Cana, Domini-
language? In ACL 2019-57th Annual Meeting of the can Republic. Association for Computational Lin-
Association for Computational Linguistics. guistics.

Arun Mallya and Svetlana Lazebnik. 2018. Packnet:


Xisen Jin, Bill Yuchen Lin, Mohammad Rostami, and Adding multiple tasks to a single network by itera-
Xiang Ren. 2021. Learn continually, generalize tive pruning. In Proceedings of the IEEE Confer-
rapidly: Lifelong knowledge accumulation for few- ence on Computer Vision and Pattern Recognition,
shot learning. In Findings of the Association for pages 7765–7773.
Computational Linguistics: EMNLP 2021, pages
714–729, Punta Cana, Dominican Republic. Asso- Fei Mi, Liangwei Chen, Mengjie Zhao, Minlie Huang,
ciation for Computational Linguistics. and Boi Faltings. 2020. Continual learning for natu-
ral language generation in task-oriented dialog sys-
Zixuan Ke, Hu Xu, and Bing Liu. 2021. Adapting tems. In Proceedings of the 2020 Conference on
BERT for continual learning of a sequence of aspect Empirical Methods in Natural Language Processing:
sentiment classification tasks. In Proceedings of the Findings, pages 3461–3474.
2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu- Jekaterina Novikova, Ondřej Dušek, and Verena Rieser.
man Language Technologies, pages 4746–4755, On- 2017. The e2e dataset: New challenges for end-to-
line. Association for Computational Linguistics. end generation. arXiv preprint arXiv:1706.09254.

3662
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt be dreaming: A new approach for data-free class-
Gardner, Christopher Clark, Kenton Lee, and Luke incremental learning. 2021 IEEE/CVF International
Zettlemoyer. 2018. Deep contextualized word rep- Conference on Computer Vision (ICCV).
resentations. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the Associ- Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee.
ation for Computational Linguistics: Human Lan- 2019. Lamol: Language modeling for lifelong lan-
guage Technologies, Volume 1 (Long Papers), pages guage learning. In International Conference on
2227–2237, New Orleans, Louisiana. Association Learning Representations.
for Computational Linguistics. Jingyuan Sun, Shaonan Wang, Jiajun Zhang, and
Chengqing Zong. 2020. Distill and replay for con-
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, tinual language learning. In Proceedings of the 28th
Kyunghyun Cho, and Iryna Gurevych. 2021. International Conference on Computational Linguis-
AdapterFusion: Non-destructive task composition tics, pages 3569–3579.
for transfer learning. In Proceedings of the 16th
Conference of the European Chapter of the Associ- Sebastian Thrun. 1998. Lifelong learning algorithms.
ation for Computational Linguistics: Main Volume, In Learning to learn, pages 181–209. Springer.
pages 487–503, Online. Association for Computa-
tional Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- Kaiser, and Illia Polosukhin. 2017. Attention is all
warya Kamath, Ivan Vulić, Sebastian Ruder, you need. In Advances in Neural Information Pro-
Kyunghyun Cho, and Iryna Gurevych. 2020. cessing Systems, volume 30. Curran Associates, Inc.
AdapterHub: A framework for adapting transform-
Hong Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo,
ers. In Proceedings of the 2020 Conference on Em-
Shiyu Chang, and William Yang Wang. 2019. Sen-
pirical Methods in Natural Language Processing:
tence embedding alignment for lifelong relation ex-
System Demonstrations, pages 46–54, Online. Asso-
traction. In Proceedings of the 2019 Conference of
ciation for Computational Linguistics.
the North American Chapter of the Association for
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Computational Linguistics: Human Language Tech-
Dario Amodei, and Ilya Sutskever. 2019. Language nologies, Volume 1 (Long and Short Papers), pages
models are unsupervised multitask learners. OpenAI 796–806, Minneapolis, Minnesota. Association for
blog, 1(8):9. Computational Linguistics.
Zirui Wang, Sanket Vaibhav Mehta, Barnabas Poczos,
Mark B Ring. 1998. Child: A first step towards contin- and Jaime Carbonell. 2020. Efficient meta lifelong-
ual learning. In Learning to learn, pages 261–292. learning with limited memory. In Proceedings of
Springer. the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP), pages 535–548,
Andreas Rücklé, Gregor Geigle, Max Glockner,
Online. Association for Computational Linguistics.
Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna
Gurevych. 2021. AdapterDrop: On the efficiency Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Pei-
of adapters in transformers. In Proceedings of the Hao Su, David Vandyke, and Steve Young. 2015.
2021 Conference on Empirical Methods in Natural Semantically conditioned LSTM-based natural lan-
Language Processing, pages 7930–7946, Online and guage generation for spoken dialogue systems. In
Punta Cana, Dominican Republic. Association for Proceedings of the 2015 Conference on Empirical
Computational Linguistics. Methods in Natural Language Processing, pages
1711–1721, Lisbon, Portugal. Association for Com-
Andrei A Rusu, Neil C Rabinowitz, Guillaume Des- putational Linguistics.
jardins, Hubert Soyer, James Kirkpatrick, Koray
Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
2016. Progressive neural networks. arXiv preprint Chaumond, Clement Delangue, Anthony Moi, Pier-
arXiv:1606.04671. ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Abigail See, Peter J Liu, and Christopher D Man- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
ning. 2017. Get to the point: Summarization Teven Le Scao, Sylvain Gugger, Mariama Drame,
with pointer-generator networks. arXiv preprint Quentin Lhoest, and Alexander Rush. 2020. Trans-
arXiv:1704.04368. formers: State-of-the-art natural language process-
ing. In Proceedings of the 2020 Conference on Em-
James Smith, Jonathan Balloch, Yen-Chang Hsu, pirical Methods in Natural Language Processing:
and Zsolt Kira. 2021a. Memory-efficient semi- System Demonstrations, pages 38–45, Online. Asso-
supervised continual learning: The world is its own ciation for Computational Linguistics.
replay buffer. arXiv preprint arXiv:2101.09536. Ac-
cepted for publication at IJCNN 2021. Xiuwei Xu, Yifan Wang, Yu Zheng, Yongming Rao, Jie
Zhou, and Jiwen Lu. 2022. Back to reality: Weakly-
James Smith, Yen-Chang Hsu, Jonathan Balloch, Yilin supervised 3d object detection with shape-guided la-
Shen, Hongxia Jin, and Zsolt Kira. 2021b. Always bel enhancement.

3663
Dani Yogatama, Cyprien de Masson d’Autume, Jerome architecture from Houlsby et al. (2019) in Adapter-
Connor, Tomas Kocisky, Mike Chrzanowski, Ling- Hub (Pfeiffer et al., 2020) with its default setting,
peng Kong, Angeliki Lazaridou, Wang Ling, Lei
in which the reduce factor for bottle-neck archi-
Yu, Chris Dyer, et al. 2019. Learning and evaluat-
ing general linguistic intelligence. arXiv preprint tecture is 16. All experiments are conducted on
arXiv:1901.11373. NVIDIA RTX 2080 Ti with 11GB memory with
a maximum batch size of 4. Training on one task
Jaehong Yoon, Eunho Yang, Jeongtae Lee, and sequence takes 5 to 9 hours.
Sung Ju Hwang. 2017. Lifelong learning with
dynamically expandable networks. arXiv preprint We use AdamW (Loshchilov and Hutter, 2019)
arXiv:1708.01547. as our optimizer. We select learning rate from
{1e − 4, 1.75e − 4, 3e − 4} and set the learning
Victor Zhong, Caiming Xiong, and Richard Socher.
rate lr = 1.75e − 4 for all tasks except WikiSQL,
2017. Seq2sql: Generating structured queries
from natural language using reinforcement learning. and lr = 3e − 4 for WikiSQL. For decision stage,
arXiv preprint arXiv:1709.00103. we train 6 epochs to make decisions. For train-
ing stage, we select the best epoch number from
A Supplementary Details and Results {9, 12, 15}, and use 9 for similar scenario and 12
for dissimilar scenario. Weight initialization pa-
Data and Metric Table 6 summaries the datasets
rameter c is selected from {0.03, 0.05, 0.07} for
and metrics we used, all datasets are using the
similar scenario and {0.12, 0.15, 0.17} for dissim-
public version from prior work Sun et al. (2019);
ilar scenario. Loss coefficient γ is selected from
Chuang et al. (2020)2 . Note that some big datasets
{0.01, 0.05}, η is set to 0.25. Following Sun et al.
(WikiSQL, CNN/DailyMail, E2E NLG, RNNLG
(2019), we use top-k sampling where k = 20 and
(laptop)) are reduced to a smaller size by random
set the pseudo-data sample rate to 0.2. In our pre-
sampling due to data imbalance.
liminary experiments, increasing the replay fre-
quency can further alleviate forgetting. Thus, for
Dataset Metric # Train # Test those approaches using pseudo experience replay
E2E NLG 6000 2000 in this work, we set half of the training batches as
RNNLG(rest.) 6228 1039 pseudo-examples whenever learning a new task.
RNNLG(hotel) ROUGE 6446 1075 Note that the original design of Adapter+CL
RNNLG(tv) 8442 1407 (Madotto et al., 2021) uses perplexity to distin-
RNNLG(laptop) 7944 2649 guish which task each testing example belongs to.
WikiSQL lfEM 6525 15878 In this work, we ignore that part and assume that
CNN/DailyMail ROUGE 6604 2250 the task-id of each testing example is given dur-
MultiWOZ dsEM 2536 1646 ing inference for all baselines and our approach to
ensure fair comparison.
Table 6: Dataset statistics and metrics. Note that
ROUGE refers to the mean of ROUGE-1, ROUGE-2 Finetuning Results We provide the results of
and ROUGE-L, lfEM stands for exact match of logical finetuning GPT-2 (Radford et al., 2019) and fine-
forms, dsEM represents turn-based dialogue state exact tuning adapter (Houlsby et al., 2019) on all eight
match. datasets in Table 7. Since Chuang et al. (2020)
shows that the generation loss Lgen could slightly
increase the performance of finetuning on certain
Task Sequences In the scenario of CL on dissim-
tasks, we also include the finetuning results after
ilar tasks, each task sequence also contains two
adding Lgen loss.
or three similar natural language generation tasks,
so the model cannot cheat by always adding new Our results confirm that finetuning adapter can
modules without detecting reusable ones. almost maintain the performance of finetuning the
whole model. We also demonstrated that the perfor-
Implementation Details We use GPT-2 (Rad- mance of finetuning adapter could be improved by
ford et al., 2019) in HugginceFace Transformers simply integrating Lgen loss. This suggests that the
(Wolf et al., 2020) as our backbone. We use the performance of Adapter+CL could be naively im-
2
proved by adding Lgen to training loss. In that case,
Datasets available at:
https://github.com/chho33/LAMOL the average of mean score for Adapter+CL could
https://github.com/voidism/L2KD be improved to 64.3 on similar task sequences and
3664
59.6 on dissimilar task sequences, which are still
significantly worse than our approach.

Method e2e rest hotel tv laptop


GPT-2finetune † 48.8 64.0 65.4 70.8 73.0
GPT-2finetune+gen † 48.8 64.2 65.5 71.0 72.8
Adapterfinetune 49.8 64.0 64.9 70.6 71.7
Adapterfinetune+gen 49.9 64.3 65.1 70.6 71.8
Method woz cnn wiki
GPT-2finetune † 84.8 25.5 63.1
GPT-2finetune+gen † 82.2 25.9 63.7
Adapterfinetune 82.8 26.0 63.1
Adapterfinetune+gen 83.5 26.0 63.8
Figure 4: Task Similarity calculated by the cosine simi-
Table 7: Finetuning results, † means we fetch numbers
larity between each task’s word frequency distribution.
from Chuang et al. (2020)

Results using Geometric Mean While the mean knowledge required for solving each task, we also
of all tasks’ performance score is always used (Sun study the performance difference to quantify the
et al., 2019; Mi et al., 2020; Madotto et al., 2021) effect of reusing different modules.
to represent the overall performance on several
Method After training on task A, we specify a
tasks, it could be largely influenced by the absolute
layer k, k = 1, 2...12 to add a new module for task
change of one single number. In this work, we
B. Then we train the model on task B together with
also leverage geometric mean as an supplementary
pseudo experience replay. After training on task
metric to measure the overall performance on dif-
B, we replace the new module with the old module
ferent tasks, which provides another perspective to
from task A in layer k, and compare the perfor-
consider relative change during comparison.
mance difference on solving task B between the
Table 8 summarizes the final performance using modified architecture and the original architecture.
geometric mean. We observed the same trend as On one hand, if the new added module contains
in Table 2, which demonstrates that our approach specific knowledge of task B, then replacing it will
does improve the performance of baselines compre- result in the absence of corresponding feature in
hensively on all tasks, not just in favor of absolute the generate output. On the other hand, if the old
value increments on some tasks. module contains specific knowledge of task A, then
Ablation Study Table 9 summarizes the full de- using it will result in some features of task A being
tails of ablation study conducted on sequence #1 generated in the output.
and #8.
Results Here we use laptop for task A and e2e for
Detailed Final Performance Table 10 provide task B. We quantify the task knowledge contained
the final performance of each task on every se- in generated output by calculating the cosine simi-
quence for our approach and Adapter+LAMOL. larity of word frequency distribution between spe-
For Adapter+CL, the final results are in Table 7. cific task’s data and generated output. In Table 11,
we see that replacing the new module in layer 11
Task similarity Figure 4 illustrates task similar- results in the most severe information loss of task
ity between five natural language generation tasks, B in the modified architecture, suggesting that the
which is calculated by the cosine similarity be- module in layer 11 contains the most important in-
tween each task’s word frequency distribution. formation of word frequency for task B. In the same
way, we conclude that module in layer 3 contains
B Module Comparison
the least important information of word frequency
In order to demonstrate the compositional nature of for task B. This is consistent with previous findings
our method, that is, each module contains different (Jawahar et al., 2019) that bag-of-word information
3665
Adapter Adapter Adapter
Methods Finetune EWC LAMOL Ours
+CL +Drop +LAMOL
Pseudo
7 7 3 7 7 3 3
Experience Replay
#1 40.2 56.0 65.7 63.7 63.4 65.4 65.6
#2 35.6 47.9 66.3 63.7 63.4 65.5 65.8
Similar Tasks
#3 50.9 60.8 66.0 63.7 63.4 64.9 65.2
#4 43.1 57.7 66.1 63.7 63.4 64.7 65.2
#5 – – 54.3 53.7 53.4 47.8 54.6
#6 – 24.0 61.6 64.1 63.6 61.2 65.0
Dissimilar Tasks
#7 16.8 36.1 53.4 53.5 52.8 51.3 54.3
#8 6.62 34.9 53.2 53.5 52.8 47.5 54.8

Table 8: Summary of final performance using geometric mean, where “–“ denotes no valid geometric mean due
to zero. We use two random seeds for each task sequence. Note that the final performance of Adapter+CL and
Adapter+Drop is not affected by task ordering within the same group of tasks. For each sequence, we mark the
best representation in bold, where LAMOL is not compared due to the difference in the order of magnitude of the
learnable parameters.

Method e2e rest hotel tv laptop Avg Avg L.P. cnn hotel wiki e2e woz Avg Avg L.P.
Ours 51.7 66.7 67.7 72.4 71.9 66.1 2.24M 27.8 65.3 62.9 51.7 83.3 58.2 6.49M
- Entropy loss 52.1 67.1 67.6 72.3 71.5 66.1 2.54M 27.8 64.8 62.6 49.8 82.9 57.6 6.49M
- Weight Ini 49.6 64.7 64.8 70.4 71.3 64.2 7.09M 26.7 64.7 64.6 49.9 82.4 57.7 8.65M
- Pseudo ER 25.6 36.6 39.9 42.8 71.2 43.2 2.08M 23.5 60.2 61.1 50.7 83.9 55.9 6.34M

Table 9: Ablation study on (i) entropy loss (ii) weight initialization (iii) pseudo experience replay. The left part
includes results for sequence #1 while the right part includes result for sequence #8. Note that “Avg“ refers to the
mean of performance score on all tasks and “Avg L.P.“ refers to the mean of learnable parameters.

is mainly captured by higher transformer layers,


while lower transformer layers capture surface and
syntactic information.
Similarly, by analyzing the cosine similarity of
word frequency distribution to task A, we find that
the old module in layer 9 contains the most impor-
tant information of word frequency for task A and
the old module in layer 5 contains the least. While
taking a closer look, we also find that modules
in different layers contain information of different
high-frequency words in task A. For example, mod-
ule in layer 9, 10 contains the most information of
the word “computing”, and “laptop”, respectively,
and module in layer 11 contains more informa-
tion of the word “business” than any other mod-
ules. This further demonstrates that different task-
specific knowledge is contained in different mod-
ules from different layers, which results in different
potential for reuse. By selectively reusing old mod-
ules to enable knowledge transfer and adding nec-
essary modules to mitigate knowledge interference,
our method derives a compositional architecture
for every new task, as depicted in Figure 3.

3666
Method - #1 e2e rest hotel tv laptop Avg
Adap+LAMOL 51.8 66.5 67.2 72.4 71.5 65.9
Ours 51.7 66.7 67.7 72.4 71.9 66.1
Method - #2 laptop tv hotel rest e2e Avg
Adap+LAMOL 74.8 75.2 65.9 66.0 49.3 66.2
Ours 64.7 74.5 51.5 73.5 49.7 66.5
Method - #3 rest tv e2e laptop hotel Avg
Adap+LAMOL 64.3 74.9 50.0 74.5 64.1 65.6
Ours 64.7 74.5 51.5 73.5 64.8 65.8
Method - #4 hotel e2e rest laptop tv Avg
Adap+LAMOL 66.4 50.9 65.8 73.0 70.0 65.2
Ours 66.4 51.3 66.2 74.2 70.6 65.7
Method - #5 woz cnn e2e rest hotel Avg
Adap+LAMOL 75.8 15.4 51.9 64.3 64.3 54.3
Ours 83.5 26.9 51.5 65.1 64.2 58.2
Method - #6 e2e wiki hotel woz rest Avg
Adap+LAMOL 53.4 47.9 64.6 80.4 64.7 62.2
Ours 50.9 64.3 65.1 84.1 64.8 65.9
Method - #7 hotel e2e woz wiki cnn Avg
Adap+LAMOL 66.0 48.5 77.5 55.4 25.8 54.6
Ours 67.0 50.9 83.5 64.1 25.9 58.3
Method - #8 cnn hotel wiki e2e woz Avg
Adap+LAMOL 16.5 65.2 52.5 51.4 83.4 53.8
Ours 27.8 65.3 62.9 51.7 83.3 58.2

Table 10: Final Performance of each task on every se-


quence. Adap+LAMOL refers Adapter+LAMOL.

Task A Task B
Layer
O M O M
1 59.6 72.5 95.1 92.5
2 60.2 72.3 95.0 93.3
3 60.1 71.3 95.1 93.6
4 60.0 70.2 95.1 93.4
5 60.0 68.9 95.2 91.3
6 59.8 72.6 95.1 88.3
7 60.0 71.2 95.0 86.2
8 59.9 72.6 95.0 81.9
9 59.6 76.7 95.0 83.8
10 59.9 74.1 95.2 81.2
11 59.9 74.5 95.0 80.3
12 59.7 75.5 94.9 82.0

Table 11: Module Comparison: the effect of replac-


ing the new module with the old module in different
layers after sequentially learning Task A and B. Num-
bers in this table refer to the cosine similarity of word
frequency distribution between the data of a specific
task and the output generated from Task B’s input
(by original architecture - O, or modified architecture
- M). We highlight the most informative layers and the
least informative layers differently.

3667

You might also like