Introduction
The Transformer is a deep learning architecture introduced in the paper “Attention is All You Need” by Vaswani et al., published in 2017.
The Transformer is based on the self-attention mechanism, which allows it to capture long-range dependencies in sequences more effectively than traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs). The key components of the Transformer are:Self-Attention Mechanism,Encoder-Decoder Architecture,Multi-Head Attention,Positional Encoding,Feed-Forward Neural Networks.
Self-Attention Mechanism
The self-attention mechanism allows the model to weigh the importance of different words in a sentence while encoding the sequence. It computes the attention scores for each word in the input sequence based on its relationships with other words. By attending to relevant words, the model can focus on the most informative parts of the sequence.
The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
The fifth step is to multiply each value vector by the softmax score.
The sixth step is to sum up the weighted value vectors.This produces the output of the self-attention layer at this position (for the first word).
Multi-Head Attention
To capture different types of dependencies and relationships, the Transformer uses multi-head attention. It performs self-attention multiple times with different learned projection matrices, allowing the model to attend to various aspects of the input.
With multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices.We concat the matrices then multiply them by an additional weights matrix WO to condense these eight down into a single matrix.
sequence-to-sequence model
A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items.the model is composed of an encoder and a decoder.The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks.By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms.The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences.A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”.Attention allows the model to focus on the relevant parts of the input sequence as needed.
attention
An attention model differs from a classic sequence-to-sequence model in two main ways:First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder;Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:Look at the set of encoder hidden states it received,Give each hidden state a score,Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores.
Encoder-Decoder Architecture
The Transformer architecture consists of two main components: the encoder and the decoder. The encoder takes an input sequence and processes it, while the decoder generates an output sequence based on the encoded representation.
One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.
The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence.The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
Layer Normalization
In traditional normalization techniques like Batch Normalization, the activations of a layer are normalized by computing the mean and variance over a batch of examples. This normalization helps stabilize and accelerate the training process, especially for deeper networks. However, it introduces a dependency on the batch size during training, which can be problematic in scenarios where batch sizes vary or during inference when processing individual samples.
Layer Normalization addresses this dependency by computing the mean and variance across all the units within a single layer for each training example. This means that normalization is done independently for each sample and does not rely on batch statistics.
Positional Encoding
Since Transformers do not inherently have positional information like RNNs, positional encodings are added to the input embeddings. These positional encodings provide the model with information about the order of the elements in the input sequence,or the distance between different words in the sequence.
Training
loss function
cross entropy
The cross-entropy loss calculates the negative log-likelihood of the true class’s predicted probability.
Kullback–Leibler divergence
Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure of how one probability distribution diverges from another.KL divergence measures the average amount of information lost when using Q to approximate P. It is not symmetric.
decoding
greedy decoding
In greedy decoding, at each step of sequence generation, the model selects the most likely output token based on its predicted probability distribution. It chooses the token with the highest probability without considering the impact on future decisions. This means that the model makes locally optimal choices at each step without considering the global context of the entire sequence.For example, in machine translation, a model using greedy decoding will predict each target word one at a time, selecting the word with the highest probability given the source sentence and previously generated words. The process continues iteratively until an end-of-sentence token is generated.
Beam search
In beam search, instead of selecting only the most likely token at each step, the algorithm maintains a fixed-size list, known as the “beam,” containing the most promising candidate sequences. The beam size determines how many candidate sequences are considered at each decoding step.
At the beginning of the decoding process, the beam is initialized with a single token representing the start of the sequence. At each step, the model generates the probabilities for the next possible tokens and expands the beam with the top-k most likely candidate sequences based on their cumulative probabilities. The k represents the beam size, and higher values of k result in a more diverse exploration of possibilities.
references
http://fancyerii.github.io/2019/03/09/transformer-illustrated/#%E6%A6%82%E8%BF%B0
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
http://jalammar.github.io/illustrated-transformer/
https://colah.github.io/posts/2015-09-Visual-Information/
https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained