T5: The Text-to-Text Transfer Transformer

Explore how T5 (Text-to-Text Transfer Transformer) simplifies NLP with a unified text-to-text approach.

T5 Cover

Introduction

The T5 (Text-to-Text Transfer Transformer), introduced by Google Research in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", presents a unified framework where every NLP task is framed as a text-to-text problem. Unlike traditional models where classification, regression, and sequence generation require distinct architectures, T5 treats everything as text generation.

This approach allows T5 to be used for tasks such as:

Architecture of T5

T5 is built upon the Transformer model, but with crucial modifications. Here’s an overview of its key architectural changes:

Transformer Backbone

T5 retains the encoder-decoder structure of the original Transformer but makes several refinements:

Mathematically, if XX is the input sequence, the T5 encoder processes it as:

H=Encoder(X)H = \text{Encoder}(X)

where HH is the final hidden representation passed to the decoder.

The decoder generates the output sequence autoregressively, predicting one token at a time:

yt=Decoder(H,y<t)y_t = \text{Decoder}(H, y_{<t})

where yty_t is the predicted token at time step tt, given the previous tokens y<ty_{<t}.

Relative Position Bias

Instead of absolute positional encodings like in the original Transformer, T5 incorporates relative position embeddings following the formula:

Attention(Q,K,V)=softmax(QKTdk+B)V\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} + B \right) V

where:

This relative bias allows the model to better generalize to different sequence lengths.

Scaling and Efficient Feedforward Networks

T5 uses GeLU activation instead of ReLU in the feedforward network:

FFN(x)=W2GELU(W1x)\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x)

where W1W_1 and W2W_2 are weight matrices.

Additionally, the model uses factorized embeddings to reduce memory consumption.

Pre-Training of T5

C4 Dataset

T5 is primarily pre-trained on the C4 (Colossal Clean Crawled Corpus), a massive dataset of approximately 750GB of cleaned text from the Common Crawl. This large-scale corpus provides diverse linguistic coverage.

Note: They have released dataset as a part of Tensorflow Datasets

Span Corruption Procedure

T5 follows a self-supervised pre-training approach called "Span Corruption", where spans of tokens are randomly replaced with a single sentinel token (mask token), and the model learns to reconstruct the missing tokens. In T5’s span corruption:

  1. Identify Spans: Randomly sample spans of length between 1 to a maximum (e.g., 10-20 tokens).
  2. Replace Each Span: Each span is replaced in the encoder input with a special <extra_id_k> token. If multiple spans are masked, we use distinct identifiers <extra_id_0>, <extra_id_1>, etc.
  3. Decoder Target: The decoder must produce these missing spans in the correct order, separated by the same <extra_id_k> markers.

For example, consider the sentence: "The Eiffel Tower is located in Paris."

A possible corrupted version for pre-training might be: "The <extra_id_0> is located in <extra_id_1>."

And the expected output is: "<extra_id_0> Eiffel Tower <extra_id_1> Paris."

Pre-Training Objective Function

Let x=(x1,x2,,xN)\mathbf{x} = (x_1, x_2, \ldots, x_N) be the original uncorrupted token sequence. During pre-training, we corrupt x\mathbf{x} by replacing some contiguous spans with special tokens, resulting in the corrupted sequence xmask\mathbf{x}^{\text{mask}}.

The target output sequence, which the decoder must generate, contains the dropped spans in an autoregressive order, e.g.:

y=(<extra_id_0>,w1,,wm,<extra_id_1>,z1,,zk,)\mathbf{y} = (\texttt{<extra\_id\_0>}, w_1, \ldots, w_m, \texttt{<extra\_id\_1>}, z_1, \ldots, z_k, \dots)

where (w1,,wm),(z1,,zk),(w_1, \ldots, w_m), (z_1, \ldots, z_k), \dots are the spans originally removed from x\mathbf{x}.

The T5 model (parameterized by θ\theta) computes the conditional probability

pθ(yxmask)=t=1ypθ(yty<t,xmask).p_\theta(\mathbf{y} \mid \mathbf{x}^{\text{mask}}) = \prod_{t=1}^{|\mathbf{y}|} p_\theta(y_t \mid y_{<t}, \mathbf{x}^{\text{mask}}).

We train T5 to maximize the log-likelihood of the correct sequence y\mathbf{y}, or equivalently minimize the negative log-likelihood (cross-entropy):

L(θ)=t=1ylogpθ(yty<t,xmask).\mathcal{L}(\theta) = -\sum_{t=1}^{|\mathbf{y}|} \log \, p_\theta(y_t \mid y_{<t}, \mathbf{x}^{\text{mask}}).

This span corruption objective encourages the model to learn how to generate missing pieces of text conditioned on the rest of the sequence, a crucial skill for downstream text-to-text tasks.

Fine-Tuning of T5

Fine-tuning in T5 follows a straightforward approach: the input text and target text are formatted according to the task, and the model is trained on supervised data.

Task-Specific Prompting

Each task is formatted as a text-to-text mapping using prefix prompts. Examples:

Sentiment Analysis

Summarization

Question Answering

Translation (English → French):

Fine-Tuning Objective Function

Each task is cast in a text-to-text format:

  1. Task Prefix: A short prefix that tells T5 what to do (e.g., “translate English to German:”, “summarize:”, “sst2 sentence:” for sentiment).
  2. Input Sequence: The actual text input or prompt.
  3. Output Sequence: The expected text output (labels, summaries, translations, etc.).

During fine-tuning:

minθ  t=1y(i)logpθ(yt(i)y<t(i),x(i))\min_\theta \; - \sum_{t=1}^{|\mathbf{y}^{(i)}|} \log \, p_\theta\bigl(y_t^{(i)} \mid y_{<t}^{(i)}, \mathbf{x}^{(i)}\bigr)

where x(i)\mathbf{x}^{(i)} is the task-specific input (with a prefix), and y(i)\mathbf{y}^{(i)} is the desired output text. The parameters θ\theta are initialized from the pre-trained T5 weights.

Through this process, T5 transfers its learned span corruption capabilities to a wide variety of tasks. Because it is fully autoregressive in the decoder, it can produce free-form text (summaries, translations, etc.), while also being able to generate short “labels” (like sentiment classes).

Performance of T5

Benchmarks on NLP Tasks

T5 achieves state-of-the-art results on several NLP benchmarks:

TaskDatasetMetricT5 Score
Text ClassificationSST-2Accuracy96.4%
SummarizationCNN/Daily MailROUGE-L41.1
Machine TranslationWMT14 En-FrBLEU43.1
QASQuAD v1.1F1-score94.8

T5 outperforms BERT and GPT-based models on multi-task learning, showcasing its generalization capabilities.

Model Variants and Scaling

T5 comes in different sizes:

Conclusion

T5’s text-to-text formulation makes it an elegant and powerful model for a wide variety of NLP tasks. It unifies different NLP tasks under a single model and shows state-of-the-art performance in many benchmarks.

For further exploration, you can experiment with Hugging Face's T5 implementation:

# !pip install datasets transformers[sentencepiece]
# !pip install sentencepiece
# !pip install torch

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_text = "summarize: The Eiffel Tower is a famous landmark in France."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)  # Expected output: "Eiffel Tower is a famous French landmark."

T5's flexibility, efficiency, and performance make it one of the most widely used Transformer-based models today. 🚀

Next Steps