6. Pre-training & Loading models

Text Generation

In order to train a model we will need that model to be able to generate new tokens. Then we will compare the generated tokens with the expected ones in order to train the model into learning the tokens it needs to generate.

As in the previous examples we already predicted some tokens, it's possible to reuse that function for this purpose.

Text Evaluation

In order to perform a correct training it's needed to measure check the predictions obtained for the expected token. The goal of the training is to maximize the likelihood of the correct token, which involves increasing its probability relative to other tokens.

In order to maximize the probability of the correct token, the weights of the model must be modified to that probability is maximised. The updates of the weights is done via backpropagation. This requires a loss function to maximize. In this case, the function will be the difference between the performed prediction and the desired one.

However, instead of working with the raw predictions, it will work with a logarithm with base n. So if the current prediction of the expected token was 7.4541e-05, the natural logarithm (base e) of 7.4541e-05 is approximately -9.5042. Then, for each entry with a context length of 5 tokens for example, the model will need to predict 5 tokens, being the first 4 tokens the last one of the input and the fifth the predicted one. Therefore, for each entry we will have 5 predictions in that case (even if the first 4 ones were in the input the model doesn't know this) with 5 expected token and therefore 5 probabilities to maximize.

Therefore, after performing the natural logarithm to each prediction, the average is calculated, the minus symbol removed (this is called cross entropy loss) and thats the number to reduce as close to 0 as possible because the natural logarithm of 1 is 0:

Another way to measure how good the model is is called perplexity. Perplexity is a metric used to evaluate how well a probability model predicts a sample. In language modelling, it represents the model's uncertainty when predicting the next token in a sequence. For example, a perplexity value of 48725, means that when needed to predict a token it's unsure about which among 48,725 tokens in the vocabulary is the good one.

Pre-Train Example

This is the initial code proposed in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb some times slightly modify

Previous code used here but already explained in previous sections

Let's see an explanation step by step

Functions to transform text <--> ids

These are some simple functions that can be used to transform from texts from the vocabulary to ids and backwards. This is needed at the begging of the handling of the text and at the end fo the predictions:

Generate text functions

In a previos section a function that just got the most probable token after getting the logits. However, this will mean that for each entry the same output is always going to be generated which makes it very deterministic.

The following generate_text function, will apply the top-k , temperature and multinomial concepts.

  • The top-k means that we will start reducing to -inf all the probabilities of all the tokens expect of the top k tokens. So, if k=3, before making a decision only the 3 most probably tokens will have a probability different from -inf.

  • The temperature means that every probability will be divided by the temperature value. A value of 0.1 will improve the highest probability compared with the lowest one, while a temperature of 5 for example will make it more flat. This helps to improve to variation in responses we would like the LLM to have.

  • After applying the temperature, a softmax function is applied again to make all the reminding tokens have a total probability of 1.

  • Finally, instead of choosing the token with the biggest probability, the function multinomial is applied to predict the next token according to the final probabilities. So if token 1 had a 70% of probabilities, token 2 a 20% and token 3 a 10%, 70% of the times token 1 will be selected, 20% of the times it will be token 2 and 10% of the times will be 10%.

There is a common alternative to top-k called top-p, also known as nucleus sampling, which instead of getting k samples with the most probability, it organizes all the resulting vocabulary by probabilities and sums them from the highest probability to the lowest until a threshold is reached.

Then, only those words of the vocabulary will be considered according to their relative probabilities

This allows to not need to select a number of k samples, as the optimal k might be different on each case, but only a threshold.

Note that this improvement isn't included in the previous code.

Another way to improve the generated text is by using Beam search instead of the greedy search sued in this example. Unlike greedy search, which selects the most probable next word at each step and builds a single sequence, beam search keeps track of the top 𝑘 k highest-scoring partial sequences (called "beams") at each step. By exploring multiple possibilities simultaneously, it balances efficiency and quality, increasing the chances of finding a better overall sequence that might be missed by the greedy approach due to early, suboptimal choices.

Note that this improvement isn't included in the previous code.

Loss functions

The calc_loss_batch function calculates the cross entropy of the a prediction of a single batch. The calc_loss_loader gets the cross entropy of all the batches and calculates the average cross entropy.

Gradient clipping is a technique used to enhance training stability in large neural networks by setting a maximum threshold for gradient magnitudes. When gradients exceed this predefined max_norm, they are scaled down proportionally to ensure that updates to the model’s parameters remain within a manageable range, preventing issues like exploding gradients and ensuring more controlled and stable training.

Note that this improvement isn't included in the previous code.

Check the following example:

Loading Data

The functions create_dataloader_v1 and create_dataloader_v1 were already discussed in a previous section.

From here note how it's defined that 90% of the text is going to be used for training while the 10% will be used for validation and both sets are stored in 2 different data loaders. Note that some times part of the data set is also left for a testing set to evaluate better the performance of the model.

Both data loaders are using the same batch size, maximum length and stride and num workers (0 in this case). The main differences are the data used by each, and the the validators is not dropping the last neither shuffling the data is it's not needed for validation purposes.

Also the fact that stride is as big as the context length, means that there won't be overlapping between contexts used to train the data (reduces overfitting but also the training data set).

Moreover, note that the batch size in this case it 2 to divide the data in 2 batches, the main goal of this is to allow parallel processing and reduce the consumption per batch.

Sanity Checks

The goal is to check there are enough tokens for training, shapes are the expected ones and get some info about the number of tokens used for training and for validation:

Select device for training & pre calculations

The following code just select the device to use and calculates a training loss and validation loss (without having trained anything yet) as a starting point.

Training functions

The function generate_and_print_sample will just get a context and generate some tokens in order to get a feeling about how good is the model at that point. This is called by train_model_simple on each step.

The function evaluate_model is called as frequently as indicate to the training function and it's used to measure the train loss and the validation loss at that point in the model training.

Then the big function train_model_simple is the one that actually train the model. It expects:

  • The train data loader (with the data already separated and prepared for training)

  • The validator loader

  • The optimizer to use during training: This is the function that will use the gradients and will update the parameters to reduce the loss. In this case, as you will see, AdamW is used, but there are many more.

    • optimizer.zero_grad() is called to reset the gradients on each round to not accumulate them.

    • The lr param is the learning rate which determines the size of the steps taken during the optimization process when updating the model's parameters. A smaller learning rate means the optimizer makes smaller updates to the weights, which can lead to more precise convergence but might slow down training. A larger learning rate can speed up training but risks overshooting the minimum of the loss function (jump over the point where the loss function is minimized).

    • Weight Decay modifies the Loss Calculation step by adding an extra term that penalizes large weights. This encourages the optimizer to find solutions with smaller weights, balancing between fitting the data well and keeping the model simple preventing overfitting in machine learning models by discouraging the model from assigning too much importance to any single feature.

      • Traditional optimizers like SGD with L2 regularization couple weight decay with the gradient of the loss function. However, AdamW (a variant of Adam optimizer) decouples weight decay from the gradient update, leading to more effective regularization.

  • The device to use for training

  • The number of epochs: Number of times to go over the training data

  • The evaluation frequency: The frequency to call evaluate_model

  • The evaluation iteration: The number of batches to use when evaluating the current state of the model when calling generate_and_print_sample

  • The start context: Which the starting sentence to use when calling generate_and_print_sample

  • The tokenizer

To improve the learning rate there are a couple relevant techniques called linear warmup and cosine decay.

Linear warmup consist on define an initial learning rate and a maximum one and consistently update it after each epoch. This is because starting the training with smaller weight updates decreases the risk of the model encountering large, destabilizing updates during its training phase. Cosine decay is a technique that gradually reduces the learning rate following a half-cosine curve after the warmup phase, slowing weight updates to minimize the risk of overshooting the loss minima and ensure training stability in later phases.

Note that these improvements aren't included in the previous code.

Start training

With the following function it's possible to print the evolution of the model while it was being trained.

Save the model

It's possible to save the model + optimizer if you want to continue training later:

Or just the model if you are planing just on using it:

Loading GPT2 weights

There 2 quick scripts to load the GPT2 weights locally. For both you can clone the repository https://github.com/rasbt/LLMs-from-scratch locally, then:

References

Last updated