Transformers
The culmination of concepts such as attention, contextual embeddings, and recurrence-free architectures led to what we now call transformer architectures. The transformer architecture was presented in the seminal paper Attention is All You Need by Vaswani et al. back in 20172. This work represented a complete paradigm shift in the NLP space; it presented not just a powerful architecture but also a smart use of some of the recently developed concepts, helping it beat state-of-the-art models by a considerable margin across different benchmarks.
At its core, a transformer is a recurrence and convolution-free attention-based encoder-decoder architecture. It solely depends upon the attention mechanism (hence the title) to learn local and global dependencies, thus enabling massive parallelization along with other performance improvements on various NLP tasks.
Overall architecture
Unlike earlier encoder-decoder architectures in the NLP domain, this work presented...