6. Pre-training & Loading models
Text Generation
In order to train a model we will need that model to be able to generate new tokens. Then we will compare the generated tokens with the expected ones in order to train the model into learning the tokens it needs to generate.
As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.
The goal of this sixth phase is very simple: Train the model from scratch. For this the previous LLM architecture will be used with some loops going over the data sets using the defined loss functions and optimizer to train all the parameters of the model.
Text Evaluation
In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.
In order to maximize the probability of the correct token, the weights of the model must be modified to that probability is maximised. The updates of the weights is done via backpropagation. This requires a loss function to maximize. In this case, the function will be the difference between the performed prediction and the desired one.
However, instead of working with the raw predictions, it will work with a logarithm with base n. So if the current prediction of the expected token was 7.4541e-05, the natural logarithm (base e) of 7.4541e-05 is approximately -9.5042. Then, for each entry with a context length of 5 tokens for example, the model will need to predict 5 tokens, being the first 4 tokens the last one of the input and the fifth the predicted one. Therefore, for each entry we will have 5 predictions in that case (even if the first 4 ones were in the input the model doesn't know this) with 5 expected token and therefore 5 probabilities to maximize.
Therefore, after performing the natural logarithm to each prediction, the average is calculated, the minus symbol removed (this is called cross entropy loss) and thats the number to reduce as close to 0 as possible because the natural logarithm of 1 is 0:
Another way to measure how good the model is is called perplexity. Perplexity is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the model's uncertainty when predicting the next token in a sequence. For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.
Pre-Train Example
This is the initial code proposed in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb some times slightly modify
Let's see an explanation step by step
Functions to transform text <--> ids
These are some simple functions that can be used to transform from texts from the vocabulary to ids and backwards. This is needed at the begging of the handling of the text and at the end fo the predictions:
Generate text functions
In a previos section a function that just got the most probable token after getting the logits. However, this will mean that for each entry the same output is always going to be generated which makes it very deterministic.
The following generate_text function, will apply the top-k , temperature and multinomial concepts.
The
top-kmeans that we will start reducing to-infall the probabilities of all the tokens expect of the top k tokens. So, if k=3, before making a decision only the 3 most probably tokens will have a probability different from-inf.The
temperaturemeans that every probability will be divided by the temperature value. A value of0.1will improve the highest probability compared with the lowest one, while a temperature of5for example will make it more flat. This helps to improve to variation in responses we would like the LLM to have.After applying the temperature, a
softmaxfunction is applied again to make all the reminding tokens have a total probability of 1.Finally, instead of choosing the token with the biggest probability, the function
multinomialis applied to predict the next token according to the final probabilities. So if token 1 had a 70% of probabilities, token 2 a 20% and token 3 a 10%, 70% of the times token 1 will be selected, 20% of the times it will be token 2 and 10% of the times will be 10%.
Loss functions
The calc_loss_batch function calculates the cross entropy of the a prediction of a single batch.
The calc_loss_loader gets the cross entropy of all the batches and calculates the average cross entropy.

Loading Data
The functions create_dataloader_v1 and create_dataloader_v1 were already discussed in a previous section.
From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders. Note that some times part of the data set is also left for a testing set to evaluate better the performance of the model.
Both data loaders are using the same batch size, maximum length and stride and num workers (0 in this case). The main differences are the data used by each, and the the validators is not dropping the last neither shuffling the data is it's not needed for validation purposes.
Also the fact that stride is as big as the context length, means that there won't be overlapping between contexts used to train the data (reduces overfitting but also the training data set).
Moreover, note that the batch size in this case it 2 to divide the data in 2 batches, the main goal of this is to allow parallel processing and reduce the consumption per batch.
Sanity Checks
The goal is to check there are enough tokens for training, shapes are the expected ones and get some info about the number of tokens used for training and for validation:
Select device for training & pre calculations
The following code just select the device to use and calculates a training loss and validation loss (without having trained anything yet) as a starting point.
Training functions
The function generate_and_print_sample will just get a context and generate some tokens in order to get a feeling about how good is the model at that point. This is called by train_model_simple on each step.
The function evaluate_model is called as frequently as indicate to the training function and it's used to measure the train loss and the validation loss at that point in the model training.
Then the big function train_model_simple is the one that actually train the model. It expects:
The train data loader (with the data already separated and prepared for training)
The validator loader
The optimizer to use during training: This is the function that will use the gradients and will update the parameters to reduce the loss. In this case, as you will see,
AdamWis used, but there are many more.optimizer.zero_grad()is called to reset the gradients on each round to not accumulate them.The
lrparam is the learning rate which determines the size of the steps taken during the optimization process when updating the model's parameters. A smaller learning rate means the optimizer makes smaller updates to the weights, which can lead to more precise convergence but might slow down training. A larger learning rate can speed up training but risks overshooting the minimum of the loss function (jump over the point where the loss function is minimized).Weight Decay modifies the Loss Calculation step by adding an extra term that penalizes large weights. This encourages the optimizer to find solutions with smaller weights, balancing between fitting the data well and keeping the model simple preventing overfitting in machine learning models by discouraging the model from assigning too much importance to any single feature.
Traditional optimizers like SGD with L2 regularization couple weight decay with the gradient of the loss function. However, AdamW (a variant of Adam optimizer) decouples weight decay from the gradient update, leading to more effective regularization.
The device to use for training
The number of epochs: Number of times to go over the training data
The evaluation frequency: The frequency to call
evaluate_modelThe evaluation iteration: The number of batches to use when evaluating the current state of the model when calling
generate_and_print_sampleThe start context: Which the starting sentence to use when calling
generate_and_print_sampleThe tokenizer
Start training
Print training evolution
With the following function it's possible to print the evolution of the model while it was being trained.
Save the model
It's possible to save the model + optimizer if you want to continue training later:
Or just the model if you are planing just on using it:
Loading GPT2 weights
There 2 quick scripts to load the GPT2 weights locally. For both you can clone the repository https://github.com/rasbt/LLMs-from-scratch locally, then:
The script https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py will download all the weights and transform the formats from OpenAI to the ones expected by our LLM. The script is also prepared with the needed configuration and with the prompt: "Every effort moves you"
The script https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb allows you to load any of the GPT2 weights locally (just change the
CHOOSE_MODELvar) and predict text from some prompts.
References
Last updated
