ACL - 2022 - Yanzhe Zhang - Continual Sequence Generation With Adaptive Compositional Modules
ACL - 2022 - Yanzhe Zhang - Continual Sequence Generation With Adaptive Compositional Modules
Abstract
Note that we use different special tokens for dif- we test our proposed approach under two common
ferent tasks, thus we can generate examples for scenarios: (1) CL on similar tasks: in this case,
specified tasks afterwards. Combining with the the new coming tasks often share the same task
finetune loss, the overall training loss is: pattern with learned tasks, but are from different
domains. We use E2ENLG (Novikova et al., 2017)
Ltrain = Lf inetune + ηLgen and four different domains (restaurant, hotel, tv,
laptop) from RNNLG (Wen et al., 2015) to form
where η is the weight for the Lgen loss. five similar tasks. Then we use four different or-
Once our model has the ability to generate ders of these tasks as our testing task sequences.
“pseudo“ examples from old tasks, another question (2) CL on dissimilar tasks: in this case, the distribu-
is When to generate “pseudo“ examples? Since tion shift between new tasks and old tasks could be
those “pseudo“ examples are for shared modules relatively large, so the major challenge is to retain
between old tasks and the current task, we only old knowledge as much as possible while learning
generate them while some old modules are reused new tasks. In this case, we further incorporate Wik-
for the current task. In that case, we train our model iSQL (SQL query generation, Zhong et al., 2017),
using Ltrain on the current dataset together with CNN/DailyMail (news article summarization See
the generated examples. Otherwise, there is no et al., 2017), MultiWOZ (semantic state sequence
need for pseudo experience replay and we just train generation (Budzianowski et al., 2018)) into our
our model using Ltrain on the current dataset. task sequences1 . We randomly pick four different
orders as our testing task sequences. In total, we
5 Experiments use eight different task sequences (Table 1) to eval-
uate our models. The statistics/metrics for each
5.1 Datasets
dataset and the finetuing results are in Appendix A.
Following Sun et al. (2019) and Chuang et al.
(2020), we evaluate our approach on four represen-
1
tative sequence generation tasks: natural language We use “e2e” for E2ENLG, “rest” for RNNLG (restau-
rant), “hotel” for RNNLG (hotel), “tv” for RNNLG (tv), “lap-
generation, SQL query generation, summarization top” for RNNLG (laptop), “wiki” for WikiSQL, “cnn” for
and task-oriented dialogue modeling. Specifically, CNN/DailyMail, “woz” for MultiWOZ.
3657
Order Task Sequence yond these, we also provide (i) evaluation results
using geometric mean and (ii) final performance
1 e2e rest hotel tv laptop of each task in Appendix A. Table 2 summarizes
2 laptop tv hotel rest e2e the final performance on all eight task sequences.
3 rest tv e2e laptop hotel We observed that finetuning sequentially suffered
4 hotel e2e rest laptop tv from very severe forgetting, no matter on similar
5 woz cnn e2e rest hotel or dissimilar tasks, highlighting the importance of
6 e2e wiki hotel woz rest continual learning work. Though EWC can signifi-
7 hotel e2e woz wiki cnn cantly increase the performance of finetuning, its
8 cnn hotel wiki e2e woz performance is still far behind LAMOL, highlight-
ing the importance of experience replay.
Table 1: Eight random different task sequences. The
first 4 includes different orders of similar tasks, the last For sequences containing similar tasks, the
4 includes different orders including dissimilar tasks. performance of Adapter+CL is inferior to
Adapter+LAMOL even with more learnable pa-
rameters. This indicates that sharing parameters
5.2 Baselines and experience replay can further facilitate knowl-
We compare our proposed model with the follow- edge transfer when tasks are similar. On the
ing baselines: (i) Finetune (Yogatama et al., 2019): premise of pseudo experience replay, our method
We finetuned GPT-2 model on several tasks sequen- performs better than Adapter+LAMOL, demon-
tially. (ii) EWC (Kirkpatrick et al., 2017) added strating the effectiveness of our adaptive composi-
regularization on parameters according to their im- tional architecture. Our approach also achieves a
portance to old tasks. (iii) LAMOL (Sun et al., much higher parameter efficiency than Adapter+CL
2019) finetuned the whole GPT-2 model contin- and Adapter+Drop. For sequences containing
ually with the help of pseudo experience replay. dissimilar tasks where the transferable knowl-
(iv) Adapter+CL (Madotto et al., 2021) inserted edge is limited and parameter sharing might cause
adapter (Houlsby et al., 2019) modules into every degradation, Adapter+CL and Adapter+Drop seem
GPT-2’s layer for each task. (v) Adapter+Drop more robust compared to Adapter+LAMOL and
(Rücklé et al., 2021): We removed all those adapter LAMOL, since they avoid catastrophic forgetting
modules from the first three layers in GPT-2 based by parameter isolation. Using a similar number
on Adapter+CL. (vi) Adapter+LAMOL: We only of parameters to Adapter+Drop, our method out-
inserted adapter modules into every transformer performs Adapter+CL consistently on all task se-
layer for the first task, then used those adapter quences, confirming that our method can prevent
modules to learn the whole the task sequence with interference between dissimilar tasks while reduc-
pseudo experience replay. Note that ARPER (Mi ing parameter redundancy.
et al., 2020) also tackles the problem of continual
sequence generation, but it needs an extra memory 6.1 Ablation Studies
buffer to store examples from old tasks, which is We randomly selected task sequence #1 from simi-
not comparable with ours. lar tasks and sequence #8 from sequences of dis-
Implementation Details We use GPT-2 (Rad- similar tasks for our ablation studies.
ford et al., 2019) in HugginceFace Transformers Importance of Each Component To examine
(Wolf et al., 2020) as our backbone and adapter im- the importance of each component in our method,
plementation by AdapterHub (Pfeiffer et al., 2020). we experiment with different settings: not using
More details can be found in Appendix A. entropy loss (w/o Entropy Loss), initializing all
weight coefficients with zero (w/o Weight Ini), and
6 Results and Analysis
not replaying pseudo data (w/o Pseudo ER). As
To evaluate the overall performance on all tasks, shown in Table 3, we found that (i) After remov-
we use the mean of all tasks’ performance score fol- ing entropy loss, the performance on sequence #1
lowing Sun et al. (2019); Mi et al. (2020); Madotto is almost maintained by using more parameters.
et al. (2021). For each scenario (similar tasks and Meanwhile, the performance on sequence #8 drops
dissimilar tasks), we report the average of mean significantly while using the same number of pa-
scores on all sequences as an overall metric. Be- rameters. This observation suggests that the en-
3658
Adapter Adapter Adapter
Methods Finetune EWC LAMOL Ours
+CL +Drop +LAMOL
Pseudo
7 7 3 7 7 3 3
Experience Replay
#1 43.0 56.9 66.3 64.2 63.9 65.9 66.1
#2 37.0 47.9 67.0 64.2 63.9 66.2 66.5
Similar Tasks
#3 51.7 61.4 66.6 64.2 63.9 65.6 65.8
#4 45.0 58.3 66.6 64.2 63.9 65.2 65.7
Avg Performance 44.2 56.2 66.6 64.2 63.9 65.7 66.0
Avg Learnable Para. 124.45M 124.45M 124.45M 8.95M 6.71M 1.79M 2.44M
#5 33.6 37.5 57.0 57.5 57.4 54.3 58.2
#6 32.6 37.9 62.5 64.9 64.5 62.2 65.9
Dissimilar Tasks
#7 19.7 37.5 56.7 57.3 56.7 54.6 58.3
#8 26.3 38.8 56.8 57.3 56.7 53.8 58.2
Avg Performance 28.1 37.9 58.3 59.3 58.8 56.2 60.1
Avg Learnable Para. 124.45M 124.45M 124.45M 8.95M 6.71M 1.79M 6.60M
Table 2: The mean of final performance score on all tasks. We use two random seeds for each task sequence.
Note that the final performance of Adapter+CL and Adapter+Drop is not affected by task ordering within the same
group of tasks. For each sequence, we mark the best representation in bold, where LAMOL is not compared
due to the difference in the order of magnitude of the learnable parameters. For each scenario, the p-values of
paired t-test between 8 numbers of our approach and the second highest comparable baseline is smaller than 0.05,
demonstrating significant improvement.
We found that, on sequence #1, vey the information provided in the input, suffering
Adapter+LAMOL and our method consis- from generating grammar mistakes and missing
tently outperform Adapter+CL in all stages, which key points. This could be attributed to the inter-
could be explained by better knowledge transfer ference from learning new coming tasks. While
between multiple tasks. Beyond that, our method Adapter+CL successfully mitigate this problem
outperforms Adapter+LAMOL in most cases, by parameter isolation, our approach works sim-
demonstrating the benefits of adaptively adding ilarly using less parameters and generates better
modules. On sequence #8, Adapter+LAMOL sequences without redundant information.
struggles when the length of task sequence
becomes longer. As more and more tasks arrive, 6.2 The Growth of Compositional Modules
the impact of task dissimilarity and distribution To illustrate the process of adding/reusing modules,
shift gets larger that pseudo experience replay we depict the model architecture at each stage in
cannot cope with. In that case, there is limited Fig 3 using sequence #4, which is the most chal-
backward transfer but severe forgetting. In contrast, lenging sequence containing similar tasks accord-
Adapter+CL and our method demonstrate their ing to Table 2. Since the similarity between the
robustness after learning more tasks in a stream. second task (e2e) and the first task (hotel) is low
Our method also outperforms Adapter throughout (see Figure 4 in Appendix A), our framework au-
the learning process, demonstrating we can enable tomatically learns to add extra adapter modules in
knowledge transfer even the similarity between layer {6, 8, 9, 10, 11} before training on the second
tasks is limited. task. When the third task (rest) arrives, given its
high similarity to the first task, our method cor-
Case Study We selected e2e in sequence #1 and rectly decides to reuse all modules used in the first
wiki in sequence #8 as two representative tasks to task. Interestingly, the architecture for the fourth
illustrate the final output generated by different ap- task is composed of shared modules with the first
proaches in Table 5. After training on the whole 3 tasks in layer {1, 2, 3, 4, 5, 7, 12}, shared module
sequence, Adapter+LAMOL cannot correctly con- with the second task in layer 6, shared the mod-
3660
E2E NLG (#1): name[Strada], eatType[coffee shop], area[city centre]
Reference There is a coffee shop in the city centre called the Strada.
Adapter+CL Strada serves coffee, is a nice coffee shop, in city centre.
Adapter+LAMOL Strada is a coffee shop serving city centre food
Ours Strada is a coffee shop located in the city centre.
WikiSQL (#8): which team has pick 13 in round 2 ?
Reference select team from table where round = 2 and pick = 13
Adapter+CL select team from table where pick = 13 and round = round 2
Adapter+LAMOL select team from table where round = 2 (missing: and pick = 13)
Ours select team from table where pick = 13 and round = 2
Table 5: Output comparison after training on sequence #1 and #8. We visualized e2e and wiki as two representative
tasks and color redundant information in red, missing information in blue and grammar mistakes in orange.
ule with the first and the third task in layer 8, and References
also added new modules for the fourth task in layer Craig Atkinson, Brendan McCane, Lech Szymanski,
{9, 10, 11}. For the fifth task, our method reuses and Anthony Robins. 2018. Pseudo-recursal: Solv-
all modules used by the fourth tasks due to their ing the catastrophic forgetting problem in deep neu-
high similarity. This demonstrates that our method ral networks.
is adaptive to different incoming tasks and is able to Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-
compose modules from different old tasks for new ton. 2016. Layer normalization.
tasks. We also provide a comparison in Appendix B
to demonstrate the effect of reusing modules from Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra-
different transformer layers. madan, and Milica Gašić. 2018. Multiwoz–a
large-scale multi-domain wizard-of-oz dataset for
7 Conclusion task-oriented dialogue modelling. arXiv preprint
arXiv:1810.00278.
This work examined continual sequence generation
with adaptive compositional modules, where we Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal,
proposed hidden state mixing to adaptively com- and Diyi Yang. 2021. An empirical survey of data
augmentation for limited data learning in nlp.
pose old and new modules for new tasks and uti-
lized pseudo experience replay to facilitate knowl- Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mix-
edge transfer. Experiments conducted on various text: Linguistically-informed interpolation of hid-
sequence generation tasks demonstrated that our den space for semi-supervised text classification. In
ACL, pages 2147–2157.
method achieves better performances with higher
parameter efficiency over previous state-of-the-art Yung-Sung Chuang, Shang-Yu Su, and Yun-Nung
baselines, both on similar task sequences and dis- Chen. 2020. Lifelong language knowledge distil-
similar task sequences. Our work is also subject lation. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
to a few limitations such as the introduced extra (EMNLP), pages 2914–2924.
training time. In the future, we plan to investigate
how to further speed up the decision stage more Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
efficiently and generalize the current framework to Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
more diverse NLP tasks such as text classification standing. In Proceedings of the 2019 Conference
and machine translation. of the North American Chapter of the Association
for Computational Linguistics: Human Language
Acknowledgment Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
We would like to thank the anonymous reviewers ation for Computational Linguistics.
for their helpful comments, and the members of Beyza Ermis, Giovanni Zappella, Martin Wistuba, and
Georgia Tech SALT group for their feedback. This Cedric Archambeau. 2022. Memory efficient contin-
work is funded in part by Salesforce and Cisco. ual learning for neural text classification.
3661
Junliang Guo, Zhirui Zhang, Linli Xu, Boxing Chen, James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
and Enhong Chen. 2021. Adaptive adapters: An Joel Veness, Guillaume Desjardins, Andrei A Rusu,
efficient way to incorporate bert into neural ma- Kieran Milan, John Quan, Tiago Ramalho, Ag-
chine translation. IEEE/ACM Transactions on Au- nieszka Grabska-Barwinska, et al. 2017. Over-
dio, Speech, and Language Processing, 29:1740– coming catastrophic forgetting in neural networks.
1751. Proceedings of the national academy of sciences,
114(13):3521–3526.
Xu Han, Yi Dai, Tianyu Gao, Yankai Lin, Zhiyuan Liu,
Peng Li, Maosong Sun, and Jie Zhou. 2020. Contin- Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
ual relation learning via episodic memory activation jan Ghazvininejad, Abdelrahman Mohamed, Omer
and reconsolidation. In Proceedings of the 58th An- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
nual Meeting of the Association for Computational 2020. BART: Denoising sequence-to-sequence pre-
Linguistics, pages 6429–6440, Online. Association training for natural language generation, translation,
for Computational Linguistics. and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Linguistics, pages 7871–7880, Online. Association
Sun. 2016. Deep residual learning for image recog- for Computational Linguistics.
nition. In 2016 IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 770– Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning:
778. Optimizing continuous prompts for generation.
3662
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt be dreaming: A new approach for data-free class-
Gardner, Christopher Clark, Kenton Lee, and Luke incremental learning. 2021 IEEE/CVF International
Zettlemoyer. 2018. Deep contextualized word rep- Conference on Computer Vision (ICCV).
resentations. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the Associ- Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee.
ation for Computational Linguistics: Human Lan- 2019. Lamol: Language modeling for lifelong lan-
guage Technologies, Volume 1 (Long Papers), pages guage learning. In International Conference on
2227–2237, New Orleans, Louisiana. Association Learning Representations.
for Computational Linguistics. Jingyuan Sun, Shaonan Wang, Jiajun Zhang, and
Chengqing Zong. 2020. Distill and replay for con-
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, tinual language learning. In Proceedings of the 28th
Kyunghyun Cho, and Iryna Gurevych. 2021. International Conference on Computational Linguis-
AdapterFusion: Non-destructive task composition tics, pages 3569–3579.
for transfer learning. In Proceedings of the 16th
Conference of the European Chapter of the Associ- Sebastian Thrun. 1998. Lifelong learning algorithms.
ation for Computational Linguistics: Main Volume, In Learning to learn, pages 181–209. Springer.
pages 487–503, Online. Association for Computa-
tional Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- Kaiser, and Illia Polosukhin. 2017. Attention is all
warya Kamath, Ivan Vulić, Sebastian Ruder, you need. In Advances in Neural Information Pro-
Kyunghyun Cho, and Iryna Gurevych. 2020. cessing Systems, volume 30. Curran Associates, Inc.
AdapterHub: A framework for adapting transform-
Hong Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo,
ers. In Proceedings of the 2020 Conference on Em-
Shiyu Chang, and William Yang Wang. 2019. Sen-
pirical Methods in Natural Language Processing:
tence embedding alignment for lifelong relation ex-
System Demonstrations, pages 46–54, Online. Asso-
traction. In Proceedings of the 2019 Conference of
ciation for Computational Linguistics.
the North American Chapter of the Association for
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Computational Linguistics: Human Language Tech-
Dario Amodei, and Ilya Sutskever. 2019. Language nologies, Volume 1 (Long and Short Papers), pages
models are unsupervised multitask learners. OpenAI 796–806, Minneapolis, Minnesota. Association for
blog, 1(8):9. Computational Linguistics.
Zirui Wang, Sanket Vaibhav Mehta, Barnabas Poczos,
Mark B Ring. 1998. Child: A first step towards contin- and Jaime Carbonell. 2020. Efficient meta lifelong-
ual learning. In Learning to learn, pages 261–292. learning with limited memory. In Proceedings of
Springer. the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP), pages 535–548,
Andreas Rücklé, Gregor Geigle, Max Glockner,
Online. Association for Computational Linguistics.
Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna
Gurevych. 2021. AdapterDrop: On the efficiency Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Pei-
of adapters in transformers. In Proceedings of the Hao Su, David Vandyke, and Steve Young. 2015.
2021 Conference on Empirical Methods in Natural Semantically conditioned LSTM-based natural lan-
Language Processing, pages 7930–7946, Online and guage generation for spoken dialogue systems. In
Punta Cana, Dominican Republic. Association for Proceedings of the 2015 Conference on Empirical
Computational Linguistics. Methods in Natural Language Processing, pages
1711–1721, Lisbon, Portugal. Association for Com-
Andrei A Rusu, Neil C Rabinowitz, Guillaume Des- putational Linguistics.
jardins, Hubert Soyer, James Kirkpatrick, Koray
Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
2016. Progressive neural networks. arXiv preprint Chaumond, Clement Delangue, Anthony Moi, Pier-
arXiv:1606.04671. ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Abigail See, Peter J Liu, and Christopher D Man- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
ning. 2017. Get to the point: Summarization Teven Le Scao, Sylvain Gugger, Mariama Drame,
with pointer-generator networks. arXiv preprint Quentin Lhoest, and Alexander Rush. 2020. Trans-
arXiv:1704.04368. formers: State-of-the-art natural language process-
ing. In Proceedings of the 2020 Conference on Em-
James Smith, Jonathan Balloch, Yen-Chang Hsu, pirical Methods in Natural Language Processing:
and Zsolt Kira. 2021a. Memory-efficient semi- System Demonstrations, pages 38–45, Online. Asso-
supervised continual learning: The world is its own ciation for Computational Linguistics.
replay buffer. arXiv preprint arXiv:2101.09536. Ac-
cepted for publication at IJCNN 2021. Xiuwei Xu, Yifan Wang, Yu Zheng, Yongming Rao, Jie
Zhou, and Jiwen Lu. 2022. Back to reality: Weakly-
James Smith, Yen-Chang Hsu, Jonathan Balloch, Yilin supervised 3d object detection with shape-guided la-
Shen, Hongxia Jin, and Zsolt Kira. 2021b. Always bel enhancement.
3663
Dani Yogatama, Cyprien de Masson d’Autume, Jerome architecture from Houlsby et al. (2019) in Adapter-
Connor, Tomas Kocisky, Mike Chrzanowski, Ling- Hub (Pfeiffer et al., 2020) with its default setting,
peng Kong, Angeliki Lazaridou, Wang Ling, Lei
in which the reduce factor for bottle-neck archi-
Yu, Chris Dyer, et al. 2019. Learning and evaluat-
ing general linguistic intelligence. arXiv preprint tecture is 16. All experiments are conducted on
arXiv:1901.11373. NVIDIA RTX 2080 Ti with 11GB memory with
a maximum batch size of 4. Training on one task
Jaehong Yoon, Eunho Yang, Jeongtae Lee, and sequence takes 5 to 9 hours.
Sung Ju Hwang. 2017. Lifelong learning with
dynamically expandable networks. arXiv preprint We use AdamW (Loshchilov and Hutter, 2019)
arXiv:1708.01547. as our optimizer. We select learning rate from
{1e − 4, 1.75e − 4, 3e − 4} and set the learning
Victor Zhong, Caiming Xiong, and Richard Socher.
rate lr = 1.75e − 4 for all tasks except WikiSQL,
2017. Seq2sql: Generating structured queries
from natural language using reinforcement learning. and lr = 3e − 4 for WikiSQL. For decision stage,
arXiv preprint arXiv:1709.00103. we train 6 epochs to make decisions. For train-
ing stage, we select the best epoch number from
A Supplementary Details and Results {9, 12, 15}, and use 9 for similar scenario and 12
for dissimilar scenario. Weight initialization pa-
Data and Metric Table 6 summaries the datasets
rameter c is selected from {0.03, 0.05, 0.07} for
and metrics we used, all datasets are using the
similar scenario and {0.12, 0.15, 0.17} for dissim-
public version from prior work Sun et al. (2019);
ilar scenario. Loss coefficient γ is selected from
Chuang et al. (2020)2 . Note that some big datasets
{0.01, 0.05}, η is set to 0.25. Following Sun et al.
(WikiSQL, CNN/DailyMail, E2E NLG, RNNLG
(2019), we use top-k sampling where k = 20 and
(laptop)) are reduced to a smaller size by random
set the pseudo-data sample rate to 0.2. In our pre-
sampling due to data imbalance.
liminary experiments, increasing the replay fre-
quency can further alleviate forgetting. Thus, for
Dataset Metric # Train # Test those approaches using pseudo experience replay
E2E NLG 6000 2000 in this work, we set half of the training batches as
RNNLG(rest.) 6228 1039 pseudo-examples whenever learning a new task.
RNNLG(hotel) ROUGE 6446 1075 Note that the original design of Adapter+CL
RNNLG(tv) 8442 1407 (Madotto et al., 2021) uses perplexity to distin-
RNNLG(laptop) 7944 2649 guish which task each testing example belongs to.
WikiSQL lfEM 6525 15878 In this work, we ignore that part and assume that
CNN/DailyMail ROUGE 6604 2250 the task-id of each testing example is given dur-
MultiWOZ dsEM 2536 1646 ing inference for all baselines and our approach to
ensure fair comparison.
Table 6: Dataset statistics and metrics. Note that
ROUGE refers to the mean of ROUGE-1, ROUGE-2 Finetuning Results We provide the results of
and ROUGE-L, lfEM stands for exact match of logical finetuning GPT-2 (Radford et al., 2019) and fine-
forms, dsEM represents turn-based dialogue state exact tuning adapter (Houlsby et al., 2019) on all eight
match. datasets in Table 7. Since Chuang et al. (2020)
shows that the generation loss Lgen could slightly
increase the performance of finetuning on certain
Task Sequences In the scenario of CL on dissim-
tasks, we also include the finetuning results after
ilar tasks, each task sequence also contains two
adding Lgen loss.
or three similar natural language generation tasks,
so the model cannot cheat by always adding new Our results confirm that finetuning adapter can
modules without detecting reusable ones. almost maintain the performance of finetuning the
whole model. We also demonstrated that the perfor-
Implementation Details We use GPT-2 (Rad- mance of finetuning adapter could be improved by
ford et al., 2019) in HugginceFace Transformers simply integrating Lgen loss. This suggests that the
(Wolf et al., 2020) as our backbone. We use the performance of Adapter+CL could be naively im-
2
proved by adding Lgen to training loss. In that case,
Datasets available at:
https://github.com/chho33/LAMOL the average of mean score for Adapter+CL could
https://github.com/voidism/L2KD be improved to 64.3 on similar task sequences and
3664
59.6 on dissimilar task sequences, which are still
significantly worse than our approach.
Results using Geometric Mean While the mean knowledge required for solving each task, we also
of all tasks’ performance score is always used (Sun study the performance difference to quantify the
et al., 2019; Mi et al., 2020; Madotto et al., 2021) effect of reusing different modules.
to represent the overall performance on several
Method After training on task A, we specify a
tasks, it could be largely influenced by the absolute
layer k, k = 1, 2...12 to add a new module for task
change of one single number. In this work, we
B. Then we train the model on task B together with
also leverage geometric mean as an supplementary
pseudo experience replay. After training on task
metric to measure the overall performance on dif-
B, we replace the new module with the old module
ferent tasks, which provides another perspective to
from task A in layer k, and compare the perfor-
consider relative change during comparison.
mance difference on solving task B between the
Table 8 summarizes the final performance using modified architecture and the original architecture.
geometric mean. We observed the same trend as On one hand, if the new added module contains
in Table 2, which demonstrates that our approach specific knowledge of task B, then replacing it will
does improve the performance of baselines compre- result in the absence of corresponding feature in
hensively on all tasks, not just in favor of absolute the generate output. On the other hand, if the old
value increments on some tasks. module contains specific knowledge of task A, then
Ablation Study Table 9 summarizes the full de- using it will result in some features of task A being
tails of ablation study conducted on sequence #1 generated in the output.
and #8.
Results Here we use laptop for task A and e2e for
Detailed Final Performance Table 10 provide task B. We quantify the task knowledge contained
the final performance of each task on every se- in generated output by calculating the cosine simi-
quence for our approach and Adapter+LAMOL. larity of word frequency distribution between spe-
For Adapter+CL, the final results are in Table 7. cific task’s data and generated output. In Table 11,
we see that replacing the new module in layer 11
Task similarity Figure 4 illustrates task similar- results in the most severe information loss of task
ity between five natural language generation tasks, B in the modified architecture, suggesting that the
which is calculated by the cosine similarity be- module in layer 11 contains the most important in-
tween each task’s word frequency distribution. formation of word frequency for task B. In the same
way, we conclude that module in layer 3 contains
B Module Comparison
the least important information of word frequency
In order to demonstrate the compositional nature of for task B. This is consistent with previous findings
our method, that is, each module contains different (Jawahar et al., 2019) that bag-of-word information
3665
Adapter Adapter Adapter
Methods Finetune EWC LAMOL Ours
+CL +Drop +LAMOL
Pseudo
7 7 3 7 7 3 3
Experience Replay
#1 40.2 56.0 65.7 63.7 63.4 65.4 65.6
#2 35.6 47.9 66.3 63.7 63.4 65.5 65.8
Similar Tasks
#3 50.9 60.8 66.0 63.7 63.4 64.9 65.2
#4 43.1 57.7 66.1 63.7 63.4 64.7 65.2
#5 – – 54.3 53.7 53.4 47.8 54.6
#6 – 24.0 61.6 64.1 63.6 61.2 65.0
Dissimilar Tasks
#7 16.8 36.1 53.4 53.5 52.8 51.3 54.3
#8 6.62 34.9 53.2 53.5 52.8 47.5 54.8
Table 8: Summary of final performance using geometric mean, where “–“ denotes no valid geometric mean due
to zero. We use two random seeds for each task sequence. Note that the final performance of Adapter+CL and
Adapter+Drop is not affected by task ordering within the same group of tasks. For each sequence, we mark the
best representation in bold, where LAMOL is not compared due to the difference in the order of magnitude of the
learnable parameters.
Method e2e rest hotel tv laptop Avg Avg L.P. cnn hotel wiki e2e woz Avg Avg L.P.
Ours 51.7 66.7 67.7 72.4 71.9 66.1 2.24M 27.8 65.3 62.9 51.7 83.3 58.2 6.49M
- Entropy loss 52.1 67.1 67.6 72.3 71.5 66.1 2.54M 27.8 64.8 62.6 49.8 82.9 57.6 6.49M
- Weight Ini 49.6 64.7 64.8 70.4 71.3 64.2 7.09M 26.7 64.7 64.6 49.9 82.4 57.7 8.65M
- Pseudo ER 25.6 36.6 39.9 42.8 71.2 43.2 2.08M 23.5 60.2 61.1 50.7 83.9 55.9 6.34M
Table 9: Ablation study on (i) entropy loss (ii) weight initialization (iii) pseudo experience replay. The left part
includes results for sequence #1 while the right part includes result for sequence #8. Note that “Avg“ refers to the
mean of performance score on all tasks and “Avg L.P.“ refers to the mean of learnable parameters.
3666
Method - #1 e2e rest hotel tv laptop Avg
Adap+LAMOL 51.8 66.5 67.2 72.4 71.5 65.9
Ours 51.7 66.7 67.7 72.4 71.9 66.1
Method - #2 laptop tv hotel rest e2e Avg
Adap+LAMOL 74.8 75.2 65.9 66.0 49.3 66.2
Ours 64.7 74.5 51.5 73.5 49.7 66.5
Method - #3 rest tv e2e laptop hotel Avg
Adap+LAMOL 64.3 74.9 50.0 74.5 64.1 65.6
Ours 64.7 74.5 51.5 73.5 64.8 65.8
Method - #4 hotel e2e rest laptop tv Avg
Adap+LAMOL 66.4 50.9 65.8 73.0 70.0 65.2
Ours 66.4 51.3 66.2 74.2 70.6 65.7
Method - #5 woz cnn e2e rest hotel Avg
Adap+LAMOL 75.8 15.4 51.9 64.3 64.3 54.3
Ours 83.5 26.9 51.5 65.1 64.2 58.2
Method - #6 e2e wiki hotel woz rest Avg
Adap+LAMOL 53.4 47.9 64.6 80.4 64.7 62.2
Ours 50.9 64.3 65.1 84.1 64.8 65.9
Method - #7 hotel e2e woz wiki cnn Avg
Adap+LAMOL 66.0 48.5 77.5 55.4 25.8 54.6
Ours 67.0 50.9 83.5 64.1 25.9 58.3
Method - #8 cnn hotel wiki e2e woz Avg
Adap+LAMOL 16.5 65.2 52.5 51.4 83.4 53.8
Ours 27.8 65.3 62.9 51.7 83.3 58.2
Task A Task B
Layer
O M O M
1 59.6 72.5 95.1 92.5
2 60.2 72.3 95.0 93.3
3 60.1 71.3 95.1 93.6
4 60.0 70.2 95.1 93.4
5 60.0 68.9 95.2 91.3
6 59.8 72.6 95.1 88.3
7 60.0 71.2 95.0 86.2
8 59.9 72.6 95.0 81.9
9 59.6 76.7 95.0 83.8
10 59.9 74.1 95.2 81.2
11 59.9 74.5 95.0 80.3
12 59.7 75.5 94.9 82.0
3667