LLM-jh-GPT

Author(s): Henry Yost (henry-AY), Jessy Garcia (jgarc826)

A Generative Pre-trained Transformer (GPT) is a type of artificial intelligence that understands and generates human-like text. We will be using the PyTorch.nn (Neural network) library which houses transformer architecture. The goal of jhGPT is to output linguistic text similar to humans' capabilities. Ultimately, we want the model to produce undifferentiable text (compared to a human). The model will have a range of languages, initially starting with English, and then moving forward to other languages. The majority and basis of the architecture come from Andrej Karpathy's nanoGPT GitHub repo, however, all analyses, and text files are independent and licensed uniquely.

Transformer Architecture used in jhGPT

The picture above is the transformer architecture as described and depicted in Attention Is All You Need. In essence, a transformer is a type of artificial intelligence model that learns and analyzes patterns in heaps of data to generate new output. Transformers are a current cutting-edge natural language processing (NLP) model relying on a different type of encoder-decoder architecture. Previous encoder-decoder architectures relied mainly on Recurrent Neural Networks (RNNs), however, Transformers can entirely remove said recurrence.

Encoder Workflow

The figure below (the left half of the transformer) is the Encoder.

Step 1 - Input Embeddings

It is important to note that the embedding process only happens in the bottom-most encoder, not each encoder. The encoder begins by converting the input into tokens--words, subwords, or characters--into vectors using embedding layers. The embeddings > capture the semantic meaning of the tokens and convert them into numerical vectors.

Step 2 - Positional Encoding

Because Transformers lack a recurrence mechanism such as Recurrent Neural Networks (RNNs), a mathematical approach must be applied to introduce position-specific patterns to each token in a sequence. This process is called 'Positional Encoding', where a combination of sine and cosine functions are used to create a positional vector.

Positional Encoding using Sine

$\ PE(\text{pos}, 2i) = \sin \left( \frac{\text{pos}}{10000 \cdot \left( \frac{2i}{d_{\text{model}}} \right)} \right) $

PE\left(pos,\ 2i\right)\ =\sin\left(\frac{pos}{10000\left(\frac{2i}{d_{model}}\right)}\right)

Positional Encoding using Cosine

$\ PE(\text{pos}, 2i + 1) = \cos \left( \frac{\text{pos}}{10000 \cdot \left( \frac{2i}{d_{\text{model}}} \right)} \right) $

PE\left(pos,\ 2i\ +\ 1\right)\ =\cos\left(\frac{pos}{10000\left(\frac{2i}{d_{model}}\right)}\right)

The equations and process of positional encoding will be further detailed and explored in Fundamentals of jh-GPT - A Deep-Dive into a Transformer-Based Language Model.

Step 3 - Multi-Headed Self-Attention

The encoder utilizes a specialized attention mechanism known as self-attention. Self-attention is how the model relates each word in the input with other words. This step differs for each model, as some are token, word, or character-based (jhGPT is a character-based encoder).

This mechanism allows the encoder to concentrate on various parts of the input sequence while processing each token. Attention scores are calculated based on a query, key, and value concept (QKV). A QKV is analogous to a basic retrieval system that is most likely used in numerous websites you use daily.

Query: A vector that represents a token from the input sequence in the attention mechanism.
Key: A vector in the attention mechanism that corresponds to each token in the input sequence.
Value: Each value is associated with a given key, and where value where the query and key have the highest attention score is the final output.

Fundamentals of jh-GPT - A Deep-Dive into a Transformer-Based Language Model will provide a significantly more detailed cover of the self-attention mechanism.

Step 4 - Output of the Encoder

The final encoder layer outputs a set of vectors, each representing a deep contextual understanding of the input sequence. These output vectors are passed in as the input for the decoder in a Transformer model. The process of encoding 'paves the path' for the decoder, to produce a based on the words, tokens, or characters with the highest attention. Moreover, a unique characteristic of the encoder, is you can have N encoder layers. Each layer is an independent neural network per se, which can explore and learn unique sides of attention, resulting in a significantly more diverse conclusion.

Decoder Workflow

The figure below (the right half of the transformer) is the Decoder

The decoder in a Transformer model is responsible for generating text sequences and consists of sub-layers similar to the encoder, including two multi-headed attention layers, a pointwise feed-forward layer, residual connections, and layer normalization. Each multi-headed attention layer has a distinct function, and the decoding process concludes with a linear layer and softmax function to determine word probabilities.

Operating in an autoregressive manner, the decoder begins with a start token and utilizes previously generated outputs along with rich contextual information from the encoder. This decoding process continues until it produces a token that signifies the end of output generation.

Step 1 - Output Embeddings

At the beginning of the decoder's process, it closely resembles that of the encoder. In this stage, the input is first processed through an embedding layer.

Step 2 - Positional Encoding

After the embedding stage, the input is processed through a positional encoding layer, which generates positional embeddings. These embeddings are then directed into the first multi-head attention layer of the decoder, where attention scores specific to the decoder's input are calculated.

Step 3 - Multi-Headed Self-Attention

This process resembles the self-attention mechanism in the encoder, but with an important distinction: it restricts positions from attending to future positions. As a result, each word in the sequence remains uninfluenced by future tokens.

The steps of the Linear Classifier and Softmax will be covered significantly more in-depth in Fundamentals of jh-GPT - A Deep-Dive into a Transformer-Based Language Model

Step 4 - Output of the Decoder

The output from the final layer is converted into a predicted sequence using a linear layer followed by a softmax function to produce probabilities for each word in the vocabulary.

During operation, the decoder adds the newly generated output to its existing input list and continues the decoding process. This iterative cycle continues until the model identifies a specific token that indicates the end of the sequence. The token with the highest probability is designated as the final output, commonly represented by the end token.

References

nanoGPT (Andrej Karpathy), Attention Is All You Need

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
English		English
French		French
German		German
Spanish		Spanish
model		model
readme_files		readme_files
.gitignore		.gitignore
GPT.py		GPT.py
LICENSE		LICENSE
README.md		README.md
large_bigram.py		large_bigram.py
output.txt		output.txt
small_bigram.py		small_bigram.py
train.py		train.py
trainloss_and_valloss.csv		trainloss_and_valloss.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-jh-GPT

Transformer Architecture used in jhGPT

Encoder Workflow

Step 1 - Input Embeddings

Step 2 - Positional Encoding

Positional Encoding using Sine

Positional Encoding using Cosine

Step 3 - Multi-Headed Self-Attention

Step 4 - Output of the Encoder

Decoder Workflow

Step 1 - Output Embeddings

Step 2 - Positional Encoding

Step 3 - Multi-Headed Self-Attention

Step 4 - Output of the Decoder

References

About

Releases

Packages

Contributors 2

Languages

License

henry-AY/LLM-jh-GPT

Folders and files

Latest commit

History

Repository files navigation

LLM-jh-GPT

Transformer Architecture used in jhGPT

Encoder Workflow

Step 1 - Input Embeddings

Step 2 - Positional Encoding

Positional Encoding using Sine

Positional Encoding using Cosine

Step 3 - Multi-Headed Self-Attention

Step 4 - Output of the Encoder

Decoder Workflow

Step 1 - Output Embeddings

Step 2 - Positional Encoding

Step 3 - Multi-Headed Self-Attention

Step 4 - Output of the Decoder

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages