NLP tasks and transformer architectures
The original transformer architecture was an encoder-decoder setup that showcased state-of-the-art performance on translation and constituency parsing tasks. The authors conclude the work by stating possible applications to other NLP tasks and even other modalities, such as audio, video, and images. This was indeed what happened and opened up the field of deep learning towards a plethora of transformer-based architectures of varying sizes and capabilities.
Another key work worth mentioning is the Universal Language Model Fine-Tuning for Text Classification or ULMFiT by Howard et al3. While this paper is based on a recurrent architecture and came out after the original transformer paper, it showcased and popularised the concept of pretraining language model and transfer learning for downstream tasks. In essence, this work provided a three-step recipe for training NLP models: pretraining on a large corpus (to build an initial understanding...