From GPT-1 to GPT-3: A New Era in NLP

Trace the GPT journey: how unsupervised pre-training, scaling model sizes, and few-shot prompting reshaped modern NLP.

Cover

In 2018, OpenAI introduced GPT-1 (Generative Pre-Trained Transformer) in a paper titled “Improving Language Understanding by Generative Pre-Training” by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. This work laid the foundation for a new paradigm in large-scale language models: pre-train first on a large corpus, and then fine-tune on specific tasks. While the original Transformer architecture (Vaswani et al., 2017) introduced a powerful sequence-to-sequence model, GPT-1 showed that focusing on a decoder-only model with generative pre-training could yield remarkable results across various downstream language tasks.

Historical Context

Before GPT-1, language models were often trained from scratch for each new task or used word embeddings like Word2Vec or GloVe. GPT-1’s generative pre-training approach changed this dynamic by showing how a single large model could be adapted to multiple tasks by leveraging unsupervised data at scale.

Key Contributions

TransformerDecoderDiagram

Unsupervised Pre-Training

Formally, let a sequence of tokens be w=(w1,w2,,wn)\mathbf{w} = (w_1, w_2, \dots, w_n). GPT-1 models the probability distribution:

p(w)=t=1np(wtw1,w2,,wt1)p(\mathbf{w}) = \prod_{t=1}^{n} p\bigl(w_t \mid w_1, w_2, \dots, w_{t-1}\bigr)

To compute p(wtw1,,wt1)p(w_t \mid w_1, \dots, w_{t-1}), GPT-1 uses masked self-attention that ensures the prediction for position tt only depends on positions <t< t.

Suppose you prompt GPT-1 with: “In a hole in the ground there lived a …”

The model processes each token sequentially:

  1. while processing first token “In” → attention is only on “In”
  2. while processing second token “a” → attention is on “In” and “a”
  3. while processing third token “hole” → attention is on “In”, “a”, and “hole”
  4. … and so on.

At each step tt, the model predicts xtx_t by attending to all tokens x1,,xt1x_1, \ldots, x_{t-1} producing probabilities for the next token. This auto-regressive decoding is the hallmark of GPT-style models.

The training objective is to maximize the log-likelihood of the observed tokens:

LLM(θ)=1Ni=1Nt=1nilogpθ(wt(i)w<t(i))\mathcal{L}_{\text{LM}}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{n_i} \log p_\theta\bigl(w_t^{(i)} \mid w_{<t}^{(i)}\bigr)

where θ\theta represents the model parameters, ii indexes over all training sequences in the corpus, and NN is the number of sequences (books, or partitions of books).

Key properties during pre-training:

  1. Unidirectional Masking: Each token can only attend to tokens before it.
  2. No Additional Supervision: The model is just predicting the next word from unlabeled text.
  3. Positional Embeddings: The model adds positional information to each token embedding.

Training Specifics

Fine-Tuning Methodology

Once GPT-1 is pre-trained on a massive text corpus, it’s adapted to downstream tasks by adding a small set of task-specific parameters (often a linear layer on top of the final hidden representation).

Let x=(x1,x2,,xm)\mathbf{x} = (x_1, x_2, \dots, x_m) be the input tokens for a downstream task (e.g., movie to review to predict sentiment). The model encodes them into hidden states h=(h1,,hm)\mathbf{h} = (h_1, \dots, h_m). For classification tasks with KK labels, one often takes the the final hidden state of the hmh_m corresponding to the “EOS\langle\text{EOS}\rangle” (end-of-sequence) token and then computes:

p(y=kx)=Softmax(Whm+b)p(y = k \mid \mathbf{x}) = \text{Softmax}\bigl(W \cdot h_m + b\bigr)

where WW and bb are newly added fine-tuning parameters. The fine-tuning loss is typically the cross-entropy over the classification label yy:

Lfine-tune(θ,W,b)=logpθ(yx)\mathcal{L}_{\text{fine-tune}}(\theta, W, b) = -\log p_\theta(y \mid \mathbf{x})

Other tasks such as sequence labeling or NLI adapt similarly, either by conditioning on the final hidden state or by adding minimal additional layers.

FinetuneDiagram

GPT-2: Scaling Up

Building on the success of GPT-1, OpenAI released GPT-2 in February 2019 with the paper “Language Models are Unsupervised Multitask Learners.” GPT-2 primarily demonstrated that scaling up the size of the model and training data can further boost performance on downstream tasks—often even in a zero-shot setting (where the model is not specifically fine-tuned on the task).

Key Advances in GPT-2

  1. Larger Model Sizes: GPT-2 was released in multiple variants, ranging from 117M parameters up to 1.5B parameters in its largest public version. At the time, this was a significant leap from GPT-1’s ~110M parameters. Also, increased context window size from 512 token to 1024 tokens.
  2. More Data: Instead of just BookCorpus, GPT-2 was trained on WebText, a dataset of about 8 million web pages (roughly 40 GB of text). This broadened the domain diversity (news, blogs, Reddit posts (with karma > 3), etc.).
  3. Zero-Shot and Few-Shot Capabilities: GPT-2 showed surprisingly strong performance on tasks like text completion, reading comprehension, translation, and summarization without explicit fine-tuning, simply by conditioning on the right prompts.
  4. Improved Generation Quality: Thanks to the larger model and more diverse training data, GPT-2 could produce coherent multi-paragraph text on a variety of topics, often indistinguishable from human-written text.

Model Architecture and Training Details

Impact of GPT-2

GPT-2’s release sparked broader public awareness of large language models’ capabilities—and the potential risks. Researchers began exploring zero-shot and few-shot prompting more deeply, realizing these large Transformer models could generalize to tasks they weren’t explicitly trained on. This phenomenon started a wave of even larger models.

GPT-3: A Quantum Leap in Scale

In June 2020, OpenAI introduced GPT-3 with the paper “Language Models are Few-Shot Learners.” GPT-3 took the scaling hypothesis to new heights, introducing models up to 175B parameters, massively outperforming previous language models on a variety of tasks through in-context learning (also known as few-shot prompting).

Key Innovations in GPT-3

  1. Massive Scale: With up to 175B parameters in the largest version (dubbed GPT-3 175B), GPT-3 was over 100× larger than GPT-2’s biggest publicly-released model (1.5B parameters). Also, increased context window size from 1024 token to 2048 tokens.
  2. Few-Shot Prompting: GPT-3 showed that one could prompt the model with just a handful of examples (a “few-shot” approach) and get impressive performance on tasks like machine translation, question answering, arithmetic, and even code generation—without fine-tuning or additional gradient updates.
  3. Broad Task Coverage: GPT-3 excelled at a wide range of tasks beyond typical NLP benchmarks, including writing coherent essays, answering trivia questions, and even performing certain logical and arithmetic tasks.
  4. In-Context Learning: Rather than updating the model weights, GPT-3 uses the context of the user’s input plus a few examples to “learn” how to perform a task in real-time. This approach essentially treats the prompt itself as the “programming” of the model.

Architecture and Training

Few-Shot vs. Zero-Shot Performance

One of GPT-3’s biggest revelations was its flexibility:

This in-context learning approach has become a cornerstone of modern language model usage, reducing the need for large labeled datasets and fine-tuning.

Conclusion and Looking Ahead

The GPT series—GPT-1, GPT-2, and GPT-3—ushered in a new era of large language models:

  1. GPT-1 introduced generative pre-training using a decoder-only Transformer architecture (Post-Norm).
  2. GPT-2 demonstrated that scaling up both model size and training data yields surprisingly robust zero-shot capabilities—and it switched to a Pre-Norm block structure for improved stability.
  3. GPT-3 took scaling to an unprecedented level (175B parameters), showcasing few-shot (in-context) learning across a broad range of NLP tasks with no additional fine-tuning steps.

These developments changed the paradigm for how we build and use NLP systems. Modern research continues to push the limits with even larger models (e.g., GPT-3.5, GPT-4, and beyond), more efficient training techniques, and ongoing exploration of the ethical considerations these powerful generative models raise.

Next Steps