Author(s): Henry Yost (henry-AY), Jessy Garcia (jgarc826)
A Generative Pre-trained Transformer (GPT) is a type of artificial intelligence that understands and generates human-like text. We will be using the PyTorch.nn (Neural network) library which houses transformer architecture. The goal of jhGPT is to output linguistic text similar to humans' capabilities. Ultimately, we want the model to produce undifferentiable text (compared to a human). The model will have a range of languages, initially starting with English, and then moving forward to other languages. The majority and basis of the architecture come from Andrej Karpathy's nanoGPT GitHub repo, however, all analyses, and text files are independent and licensed uniquely.
The picture above is the transformer architecture as described and depicted in Attention Is All You Need. In essence, a transformer is a type of artificial intelligence model that learns and analyzes patterns in heaps of data to generate new output. Transformers are a current cutting-edge natural language processing (NLP) model relying on a different type of encoder-decoder architecture. Previous encoder-decoder architectures relied mainly on Recurrent Neural Networks (RNNs), however, Transformers can entirely remove said recurrence.
The figure below (the left half of the transformer) is the Encoder.
It is important to note that the embedding process only happens in the bottom-most encoder, not each encoder. The encoder begins by converting the input into tokens--words, subwords, or characters--into vectors using embedding layers. The embeddings > capture the semantic meaning of the tokens and convert them into numerical vectors.
Because Transformers lack a recurrence mechanism such as Recurrent Neural Networks (RNNs), a mathematical approach must be applied to introduce position-specific patterns to each token in a sequence. This process is called 'Positional Encoding', where a combination of sine and cosine functions are used to create a positional vector.
PE\left(pos,\ 2i\right)\ =\sin\left(\frac{pos}{10000\left(\frac{2i}{d_{model}}\right)}\right)
PE\left(pos,\ 2i\ +\ 1\right)\ =\cos\left(\frac{pos}{10000\left(\frac{2i}{d_{model}}\right)}\right)
The equations and process of positional encoding will be further detailed and explored in Fundamentals of jh-GPT - A Deep-Dive into a Transformer-Based Language Model.
The encoder utilizes a specialized attention mechanism known as self-attention. Self-attention is how the model relates each word in the input with other words. This step differs for each model, as some are token, word, or character-based (jhGPT is a character-based encoder).
This mechanism allows the encoder to concentrate on various parts of the input sequence while processing each token. Attention scores are calculated based on a query, key, and value concept (QKV). A QKV is analogous to a basic retrieval system that is most likely used in numerous websites you use daily.
- Query: A vector that represents a token from the input sequence in the attention mechanism.
- Key: A vector in the attention mechanism that corresponds to each token in the input sequence.
- Value: Each value is associated with a given key, and where value where the query and key have the highest attention score is the final output.
Fundamentals of jh-GPT - A Deep-Dive into a Transformer-Based Language Model will provide a significantly more detailed cover of the self-attention mechanism.
The final encoder layer outputs a set of vectors, each representing a deep contextual understanding of the input sequence. These output vectors are passed in as the input for the decoder in a Transformer model. The process of encoding 'paves the path' for the decoder, to produce a based on the words, tokens, or characters with the highest attention. Moreover, a unique characteristic of the encoder, is you can have N encoder layers. Each layer is an independent neural network per se, which can explore and learn unique sides of attention, resulting in a significantly more diverse conclusion.
The figure below (the right half of the transformer) is the Decoder
The decoder in a Transformer model is responsible for generating text sequences and consists of sub-layers similar to the encoder, including two multi-headed attention layers, a pointwise feed-forward layer, residual connections, and layer normalization. Each multi-headed attention layer has a distinct function, and the decoding process concludes with a linear layer and softmax function to determine word probabilities.
Operating in an autoregressive manner, the decoder begins with a start token and utilizes previously generated outputs along with rich contextual information from the encoder. This decoding process continues until it produces a token that signifies the end of output generation.
At the beginning of the decoder's process, it closely resembles that of the encoder. In this stage, the input is first processed through an embedding layer.
After the embedding stage, the input is processed through a positional encoding layer, which generates positional embeddings. These embeddings are then directed into the first multi-head attention layer of the decoder, where attention scores specific to the decoder's input are calculated.
This process resembles the self-attention mechanism in the encoder, but with an important distinction: it restricts positions from attending to future positions. As a result, each word in the sequence remains uninfluenced by future tokens.
The steps of the Linear Classifier and Softmax will be covered significantly more in-depth in Fundamentals of jh-GPT - A Deep-Dive into a Transformer-Based Language Model
The output from the final layer is converted into a predicted sequence using a linear layer followed by a softmax function to produce probabilities for each word in the vocabulary.
During operation, the decoder adds the newly generated output to its existing input list and continues the decoding process. This iterative cycle continues until the model identifies a specific token that indicates the end of the sequence. The token with the highest probability is designated as the final output, commonly represented by the end token.
nanoGPT (Andrej Karpathy), Attention Is All You Need