Building Large Language Models: A Deep Dive

Building Large Language Models: A Deep Dive

Content

Building Large Language Models: An Overview

Large Language Models (LLMs), like ChatGPT, Claude, Gemini, and Llama, are revolutionizing how we interact with technology. This article provides an overview of how these models work, covering key components and practical considerations.

Key Components for Training LLMs

Several key components are crucial for successfully training LLMs:

  • Architecture: The underlying neural network structure (typically based on transformers).
  • Training Loss & Algorithm: How the model learns and optimizes its performance.
  • Data: The information the model is trained on.
  • Evaluation: Measuring progress and performance.
  • Systems: How to efficiently run these large models on modern hardware.

Why Data, Evaluation, and Systems Matter Most

While architecture and training algorithms receive significant attention, especially in academia, data, evaluation, and systems are arguably the most critical factors in practice. These areas are the primary focus of industry efforts.

Pretraining vs. Posttraining

LLM development involves two primary phases:

  • Pretraining: Training a language model to model the vast amount of text available on the internet.
  • Posttraining: Finetuning the pretrained model to function as an AI assistant (a more recent trend).

GPT3 and GPT2 are examples of pretrained models, while ChatGPT is a posttrained model.

Pretraining in Detail: Language Modeling

The Task and the Loss

At a high level, language models are probability distributions over sequences of tokens or words. They assign probabilities to sentences, reflecting their likelihood of being uttered by a human or found online. For instance, the sentence "The mouse ate the cheese" would receive a higher probability than "The cheese ate the mouse."

Language models are generative models because you can sample from their probability distribution to generate new sentences.

Most modern LLMs use autoregressive language models, which decompose the probability of a sentence into the product of probabilities of each word given the preceding words. This is based on the chain rule of probability.

The task of an autoregressive language model is simply predicting the next word in a sequence.

Tokenization: Breaking Down Text

Tokenizers are essential for converting text into a numerical representation that LLMs can process. They are more general than simply using words because they can handle typos and languages like Thai, which don't use spaces between words.

Byte Pair Encoding (BPE) is a common tokenization algorithm. It starts by assigning each character to a token, then iteratively merging the most frequent pairs of tokens until a desired vocabulary size is reached.

Evaluation: Measuring Progress

Perplexity

Perplexity is a metric often used to evaluate language models. It's essentially the exponentiated average pertoken loss. It reflects how "hesitant" the model is when predicting the next token. A lower perplexity indicates better performance.

NLP Benchmarks

LLMs are often evaluated on standard NLP benchmarks, such as HELM and the Hugging Face open leaderboard. These benchmarks consist of various tasks, including question answering, where the model's ability to generate the correct answer is assessed.

Evaluation Challenges

Evaluation is complex and faces challenges like:

  • Inconsistencies: Different evaluation methods can lead to different results.
  • Traintest contamination: Ensuring the test data wasn't included in the training data.

Data: Fueling the Models

Collecting and Cleaning Internet Data

LLMs are often trained on vast amounts of internet data. However, this data is often noisy and requires significant processing. The steps involved include:

  • Web Crawling: Downloading content from the internet.
  • Text Extraction: Extracting text from HTML.
  • Filtering: Removing undesirable content (e.g., not safe for work, harmful content, PII).
  • Deduplication: Removing duplicate content.
  • Heuristic Filtering: Removing lowquality documents based on rules.
  • Modelbased Filtering: Training a classifier to identify highquality content.
  • Domain Classification: Categorizing data into different domains (e.g., entertainment, books, code).
  • HighQuality Data Finetuning: Training on highquality data (like Wikipedia) at the end.

Data is Key

Collecting highquality data is a significant part of training LLMs. Some even argue it's the most critical aspect.

Scaling Laws: The Power of Scale

Scaling laws describe the relationship between model size, training data, and performance. They show that larger models trained on more data generally perform better.

By analyzing scaling laws, you can predict how much performance will improve by increasing model size or training data.

Optimizing Training Resources

Scaling laws can also help optimize the allocation of training resources. For example, the Chinchilla paper found that the optimal number of tokens to train on is 20 tokens for every parameter in the model.

The bitter lesson emphasizes focusing on architectures that can leverage increasing computation and focusing on systems and data rather than minor architectural tweaks.

PostTraining: Alignment and Reinforcement Learning

The Need for Alignment

Posttraining is crucial for aligning LLMs with human preferences and making them useful AI assistants. Without alignment, models may not follow instructions correctly or may generate undesirable content.

Supervised FineTuning (SFT)

SFT involves finetuning a pretrained LLM on desired answers collected from humans. This is essentially language modeling on humangenerated responses.

Scaling Data Collection with LLMs

Since human data is expensive, LLMs can be used to generate synthetic data for SFT. While data quantity isn't as important as the quality in this fine tuning stage, having a good base of examples for the network to learn from is vital.

Reinforcement Learning from Human Feedback (RLHF)

RLHF aims to maximize human preference rather than simply cloning human behavior. This is achieved by having labelers compare two modelgenerated answers and selecting the preferred one. The model is then finetuned to generate more of the preferred outputs.

Direct Preference Optimization (DPO)

DPO is a simplification of RLHF that avoids the complexities of reinforcement learning. It directly maximizes the probability of generating preferred outputs and minimizes the probability of generating nonpreferred outputs.

Evaluating PostTraining

Evaluating posttrained models is challenging because their outputs are often openended. Common approaches include:

  • ChatBotArena: Blindly interacting with two chatbots and rating which one is better.
  • Using LLMs for Evaluation: Using another LLM to compare the outputs of two models and determine which one is better. However, care must be taken to account for biases, such as a preference for longer outputs.

Systems and Hardware Optimizations

Training LLMs requires substantial computational resources, so systemlevel optimizations are crucial. These models are designed to work on GPUs which are designed for the specific task of fast matrix multiplication. With more and more calculations required to be done, efficient use and communication with GPUs is becoming increasingly more important.

Low Precision

Reducing the precision of floatingpoint numbers (e.g., using 16bit floats instead of 32bit floats) can significantly improve performance by reducing memory consumption and communication overhead.

Operator Fusion

Operator fusion combines multiple operations into a single fused kernel. This reduces the amount of data transfer between the GPU and its global memory, leading to performance improvements. In practical PyTorch environments, enabling torch.compile automatically performs this function.

Conclusion

Building LLMs is a complex process involving architecture, training, data, evaluation, and systems. While architecture and training algorithms are important, data, evaluation, and systems are arguably the most critical factors in practice. By understanding these key components, you can gain a deeper appreciation for the power and potential of LLMs.