5. LLM Architecture
LLM Architecture
The goal of this fifth phase is very simple: Develop the architecture of the full LLM. Put everything together, apply all the layers and create all the functions to generate text or transform text to IDs and backwards.
This architecture will be used for both, training and predicting text after it was trained.
LLM architecture example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb:
A high level representation can be observed in:
Input (Tokenized Text): The process begins with tokenized text, which is converted into numerical representations.
Token Embedding and Positional Embedding Layer: The tokenized text is passed through a token embedding layer and a positional embedding layer, which captures the position of tokens in a sequence, critical for understanding word order.
Transformer Blocks: The model contains 12 transformer blocks, each with multiple layers. These blocks repeat the following sequence:
Masked Multi-Head Attention: Allows the model to focus on different parts of the input text at once.
Layer Normalization: A normalization step to stabilize and improve training.
Feed Forward Layer: Responsible for processing the information from the attention layer and making predictions about the next token.
Dropout Layers: These layers prevent overfitting by randomly dropping units during training.
Final Output Layer: The model outputs a 4x50,257-dimensional tensor, where 50,257 represents the size of the vocabulary. Each row in this tensor corresponds to a vector that the model uses to predict the next word in the sequence.
Goal: The objective is to take these embeddings and convert them back into text. Specifically, the last row of the output is used to generate the next word, represented as "forward" in this diagram.
Code representation
Let's explain it step by step:
GELU Activation Function
Purpose and Functionality
GELU (Gaussian Error Linear Unit): An activation function that introduces non-linearity into the model.
Smooth Activation: Unlike ReLU, which zeroes out negative inputs, GELU smoothly maps inputs to outputs, allowing for small, non-zero values for negative inputs.
Mathematical Definition:

FeedForward Neural Network
Shapes have been added as comments to understand better the shapes of matrices:
Purpose and Functionality
Position-wise FeedForward Network: Applies a two-layer fully connected network to each position separately and identically.
Layer Details:
First Linear Layer: Expands the dimensionality from
emb_dimto4 * emb_dim.GELU Activation: Applies non-linearity.
Second Linear Layer: Reduces the dimensionality back to
emb_dim.
Multi-Head Attention Mechanism
This was already explained in an earlier section.
Purpose and Functionality
Multi-Head Self-Attention: Allows the model to focus on different positions within the input sequence when encoding a token.
Key Components:
Queries, Keys, Values: Linear projections of the input, used to compute attention scores.
Heads: Multiple attention mechanisms running in parallel (
num_heads), each with a reduced dimension (head_dim).Attention Scores: Computed as the dot product of queries and keys, scaled and masked.
Masking: A causal mask is applied to prevent the model from attending to future tokens (important for autoregressive models like GPT).
Attention Weights: Softmax of the masked and scaled attention scores.
Context Vector: Weighted sum of the values, according to attention weights.
Output Projection: Linear layer to combine the outputs of all heads.
Layer Normalization
Purpose and Functionality
Layer Normalization: A technique used to normalize the inputs across the features (embedding dimensions) for each individual example in a batch.
Components:
eps: A small constant (1e-5) added to the variance to prevent division by zero during normalization.scaleandshift: Learnable parameters (nn.Parameter) that allow the model to scale and shift the normalized output. They are initialized to ones and zeros, respectively.
Normalization Process:
Compute Mean (
mean): Calculates the mean of the inputxacross the embedding dimension (dim=-1), keeping the dimension for broadcasting (keepdim=True).Compute Variance (
var): Calculates the variance ofxacross the embedding dimension, also keeping the dimension. Theunbiased=Falseparameter ensures that the variance is calculated using the biased estimator (dividing byNinstead ofN-1), which is appropriate when normalizing over features rather than samples.Normalize (
norm_x): Subtracts the mean fromxand divides by the square root of the variance pluseps.Scale and Shift: Applies the learnable
scaleandshiftparameters to the normalized output.
Transformer Block
Shapes have been added as comments to understand better the shapes of matrices:
Purpose and Functionality
Composition of Layers: Combines multi-head attention, feedforward network, layer normalization, and residual connections.
Layer Normalization: Applied before the attention and feedforward layers for stable training.
Residual Connections (Shortcuts): Add the input of a layer to its output to improve gradient flow and enable training of deep networks.
Dropout: Applied after attention and feedforward layers for regularization.
Step-by-Step Functionality
First Residual Path (Self-Attention):
Input (
shortcut): Save the original input for the residual connection.Layer Norm (
norm1): Normalize the input.Multi-Head Attention (
att): Apply self-attention.Dropout (
drop_shortcut): Apply dropout for regularization.Add Residual (
x + shortcut): Combine with the original input.
Second Residual Path (FeedForward):
Input (
shortcut): Save the updated input for the next residual connection.Layer Norm (
norm2): Normalize the input.FeedForward Network (
ff): Apply the feedforward transformation.Dropout (
drop_shortcut): Apply dropout.Add Residual (
x + shortcut): Combine with the input from the first residual path.
GPTModel
Shapes have been added as comments to understand better the shapes of matrices:
Purpose and Functionality
Embedding Layers:
Token Embeddings (
tok_emb): Converts token indices into embeddings. As reminder, these are the weights given to each dimension of each token in the vocabulary.Positional Embeddings (
pos_emb): Adds positional information to the embeddings to capture the order of tokens. As reminder, these are the weights given to token according to it's position in the text.
Dropout (
drop_emb): Applied to embeddings for regularisation.Transformer Blocks (
trf_blocks): Stack ofn_layerstransformer blocks to process embeddings.Final Normalization (
final_norm): Layer normalization before the output layer.Output Layer (
out_head): Projects the final hidden states to the vocabulary size to produce logits for prediction.
Number of Parameters to train
Having the GPT structure defined it's possible to find out the number of parameters to train:
Step-by-Step Calculation
1. Embedding Layers: Token Embedding & Position Embedding
Layer:
nn.Embedding(vocab_size, emb_dim)Parameters:
vocab_size * emb_dim
Layer:
nn.Embedding(context_length, emb_dim)Parameters:
context_length * emb_dim
Total Embedding Parameters
2. Transformer Blocks
There are 12 transformer blocks, so we'll calculate the parameters for one block and then multiply by 12.
Parameters per Transformer Block
a. Multi-Head Attention
Components:
Query Linear Layer (
W_query):nn.Linear(emb_dim, emb_dim, bias=False)Key Linear Layer (
W_key):nn.Linear(emb_dim, emb_dim, bias=False)Value Linear Layer (
W_value):nn.Linear(emb_dim, emb_dim, bias=False)Output Projection (
out_proj):nn.Linear(emb_dim, emb_dim)
Calculations:
Each of
W_query,W_key,W_value:Since there are three such layers:
Output Projection (
out_proj):Total Multi-Head Attention Parameters:
b. FeedForward Network
Components:
First Linear Layer:
nn.Linear(emb_dim, 4 * emb_dim)Second Linear Layer:
nn.Linear(4 * emb_dim, emb_dim)
Calculations:
First Linear Layer:
Second Linear Layer:
Total FeedForward Parameters:
c. Layer Normalizations
Components:
Two
LayerNorminstances per block.Each
LayerNormhas2 * emb_dimparameters (scale and shift).
Calculations:
d. Total Parameters per Transformer Block
Total Parameters for All Transformer Blocks
3. Final Layers
a. Final Layer Normalization
Parameters:
2 * emb_dim(scale and shift)
b. Output Projection Layer (out_head)
Layer:
nn.Linear(emb_dim, vocab_size, bias=False)Parameters:
emb_dim * vocab_size
4. Summing Up All Parameters
Generate Text
Having a model that predicts the next token like the one before, it's just needed to take the last token values from the output (as they will be the ones of the predicted token), which will be a value per entry in the vocabulary and then use the softmax function to normalize the dimensions into probabilities that sums 1 and then get the index of the of the biggest entry, which will be the index of the word inside the vocabulary.
Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb:
References
Last updated
