2. Data Sampling

Data Sampling

Data Sampling is a crucial process in preparing data for training large language models (LLMs) like GPT. It involves organizing text data into input and target sequences that the model uses to learn how to predict the next word (or token) based on the preceding words. Proper data sampling ensures that the model effectively captures language patterns and dependencies.

Why Data Sampling Matters

LLMs such as GPT are trained to generate or predict text by understanding the context provided by previous words. To achieve this, the training data must be structured in a way that the model can learn the relationship between sequences of words and their subsequent words. This structured approach allows the model to generalize and generate coherent and contextually relevant text.

Key Concepts in Data Sampling

  1. Tokenization: Breaking down text into smaller units called tokens (e.g., words, subwords, or characters).

  2. Sequence Length (max_length): The number of tokens in each input sequence.

  3. Sliding Window: A method to create overlapping input sequences by moving a window over the tokenized text.

  4. Stride: The number of tokens the sliding window moves forward to create the next sequence.

Step-by-Step Example

Let's walk through an example to illustrate data sampling.

Example Text

"Lorem ipsum dolor sit amet, consectetur adipiscing elit."

Tokenization

Assume we use a basic tokenizer that splits the text into words and punctuation marks:

Parameters

  • Max Sequence Length (max_length): 4 tokens

  • Sliding Window Stride: 1 token

Creating Input and Target Sequences

  1. Sliding Window Approach:

    • Input Sequences: Each input sequence consists of max_length tokens.

    • Target Sequences: Each target sequence consists of the tokens that immediately follow the corresponding input sequence.

  2. Generating Sequences:

    Window Position
    Input Sequence
    Target Sequence

    1

    ["Lorem", "ipsum", "dolor", "sit"]

    ["ipsum", "dolor", "sit", "amet,"]

    2

    ["ipsum", "dolor", "sit", "amet,"]

    ["dolor", "sit", "amet,", "consectetur"]

    3

    ["dolor", "sit", "amet,", "consectetur"]

    ["sit", "amet,", "consectetur", "adipiscing"]

    4

    ["sit", "amet,", "consectetur", "adipiscing"]

    ["amet,", "consectetur", "adipiscing", "elit."]

  3. Resulting Input and Target Arrays:

    • Input:

    • Target:

Visual Representation

Token Position
Token

1

Lorem

2

ipsum

3

dolor

4

sit

5

amet,

6

consectetur

7

adipiscing

8

elit.

Sliding Window with Stride 1:

  • First Window (Positions 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Target: ["ipsum", "dolor", "sit", "amet,"]

  • Second Window (Positions 2-5): ["ipsum", "dolor", "sit", "amet,"] → Target: ["dolor", "sit", "amet,", "consectetur"]

  • Third Window (Positions 3-6): ["dolor", "sit", "amet,", "consectetur"] → Target: ["sit", "amet,", "consectetur", "adipiscing"]

  • Fourth Window (Positions 4-7): ["sit", "amet,", "consectetur", "adipiscing"] → Target: ["amet,", "consectetur", "adipiscing", "elit."]

Understanding Stride

  • Stride of 1: The window moves forward by one token each time, resulting in highly overlapping sequences. This can lead to better learning of contextual relationships but may increase the risk of overfitting since similar data points are repeated.

  • Stride of 2: The window moves forward by two tokens each time, reducing overlap. This decreases redundancy and computational load but might miss some contextual nuances.

  • Stride Equal to max_length: The window moves forward by the entire window size, resulting in non-overlapping sequences. This minimizes data redundancy but may limit the model's ability to learn dependencies across sequences.

Example with Stride of 2:

Using the same tokenized text and max_length of 4:

  • First Window (Positions 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Target: ["ipsum", "dolor", "sit", "amet,"]

  • Second Window (Positions 3-6): ["dolor", "sit", "amet,", "consectetur"] → Target: ["sit", "amet,", "consectetur", "adipiscing"]

  • Third Window (Positions 5-8): ["amet,", "consectetur", "adipiscing", "elit."] → Target: ["consectetur", "adipiscing", "elit.", "sed"] (Assuming continuation)

Code Example

Let's understand this better from a code example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:

References

Last updated