Update 11th Dec 2021: see what Andrej Karpathy is saying about transformers
I had been planning to read about the Transformers for a while, but I had to wait until I had a few more days to get my hands on it. This is a seminal concept in NLP and I am going to try to explain it in laymen terms as possible. I was overwhelmed with the level of details in transformers whenever I tried reading about it. So I will give you the building blocks of transformers so that you don’t hesitate to read further.
Its important to understand why Transformers are so popular. Recurrent Neural Networks,long short-term memory and gated RNNs are the popularly approaches used for Sequence Modelling tasks such as machine translation and language modeling. However, RNN/CNN handle sequences word-by-word in a sequential fashion. This sequentiality is an obstacle toward parallelization of the process. Moreover, when such sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with following positions’ content.
Recent works have achieved significant improvements in computational efficiency and model performance through factorization tricks and conditional computation. But they are not enough to eliminate the fundamental constraint of sequential computation. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance inthe input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.But, the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and has reached a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Now we know why we are using Transformers.
The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution
The main characteristics are:
Non-Sequential: sentences are processed as a whole rather than word by word.
(Masked)Self Attention: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence.
Positional encoding: another innovation introduced to replace recurrence. The idea is to use fixed or learned weights that encode information related to a specific position of a token in a sentence.
Layer Normalization: a normalization technique that is used to stabilize the variance of activations in a layer. Neural net layers work best when input vectors have a uniform mean and std in each dimension.
Self-attention is a key component of the TransformersLet us distil how it works.
Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Say the following sentence is an input sentence we want to translate:
”The animal didn’t cross the street because it was too tired”
What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
So far, we have understood why transformers are being used and what are main components. lets talk about model architecture.
Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequenceof continuous representations z = (z1, …, zn). Given z, the decoder then generates an outputsequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.The Transformer follows this overall architecture using stacked self-attention and point-wise, fullyconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,respectively.
The idea of this post was to explain the what and why of Transformers and how it works. This was the seminal concept in NLP but it is so generic that it is expanding into other domains like vision as well. It allows you cross-pollinate your ideas even while working on completely different domains. I am pretty excited about the future of Transformers. For the longest time, I did not know why we are using transformers and why it outperforms other techniques. Once I understood it, it became go-to technique for NLP tasks. I hope it helped you to understand the concept and how it can be used in your own projects.
I highly recommend you to read the below sources as I have skipped a lot of details not to scare you away from transformers like I was scared whenever I tried to understand transformers.
[Attention all you need-*The seminal paper on transformers*](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
[Transformers- Highly recommended for deep understanding](https://jalammar.github.io/illustrated-transformer/)
[Transformers vs RNN/LSTM](https://ai.stackexchange.com/questions/20075/why-does-the-transformer-do-better-than-rnn-and-lstm-in-long-range-context-depen)