T5: The Text-to-Text Transfer Transformer

Explore how T5 (Text-to-Text Transfer Transformer) simplifies NLP with a unified text-to-text approach.

Introduction

The T5 (Text-to-Text Transfer Transformer), introduced by Google Research in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", presents a unified framework where every NLP task is framed as a text-to-text problem. Unlike traditional models where classification, regression, and sequence generation require distinct architectures, T5 treats everything as text generation.

This approach allows T5 to be used for tasks such as:

Machine translation (e.g., English → French)
Text summarization (e.g., Long text → Summary)
Sentiment classification (e.g., "Review: This movie was great!" → "Positive")
Question answering (e.g., "Question: What is the capital of France?" → "Paris")
Textual entailment (e.g., "Premise: The sky is blue. Hypothesis: The sky is not blue." → "Contradiction")

Architecture of T5

T5 is built upon the Transformer model, but with crucial modifications. Here’s an overview of its key architectural changes:

Transformer Backbone

T5 retains the encoder-decoder structure of the original Transformer but makes several refinements:

Uses Layer Normalization (LN) before self-attention and feedforward layers, rather than after (Pre Layer Normalization)
Uses relative position embeddings instead of absolute positional encodings.
Uses simplified tokenization with SentencePiece instead of subword models like BPE.

Mathematically, if $X$ is the input sequence, the T5 encoder processes it as:

H = \text{Encoder}(X)

where $H$ is the final hidden representation passed to the decoder.

The decoder generates the output sequence autoregressively, predicting one token at a time:

y_t = \text{Decoder}(H, y_{<t})

where $y_t$ is the predicted token at time step $t$ , given the previous tokens $y_{<t}$ .

Relative Position Bias

Instead of absolute positional encodings like in the original Transformer, T5 incorporates relative position embeddings following the formula:

\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} + B \right) V

where:

$Q, K, V$ are the query, key, and value matrices.
$B$ is a learned relative position bias.

This relative bias allows the model to better generalize to different sequence lengths.

Scaling and Efficient Feedforward Networks

T5 uses GeLU activation instead of ReLU in the feedforward network:

\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x)

where $W_1$ and $W_2$ are weight matrices.

Additionally, the model uses factorized embeddings to reduce memory consumption.

Pre-Training of T5

C4 Dataset

T5 is primarily pre-trained on the C4 (Colossal Clean Crawled Corpus), a massive dataset of approximately 750GB of cleaned text from the Common Crawl. This large-scale corpus provides diverse linguistic coverage.

Note: They have released dataset as a part of Tensorflow Datasets

Span Corruption Procedure

T5 follows a self-supervised pre-training approach called "Span Corruption", where spans of tokens are randomly replaced with a single sentinel token (mask token), and the model learns to reconstruct the missing tokens. In T5’s span corruption:

Identify Spans: Randomly sample spans of length between 1 to a maximum (e.g., 10-20 tokens).
Replace Each Span: Each span is replaced in the encoder input with a special <extra_id_k> token. If multiple spans are masked, we use distinct identifiers <extra_id_0>, <extra_id_1>, etc.
Decoder Target: The decoder must produce these missing spans in the correct order, separated by the same <extra_id_k> markers.

For example, consider the sentence: "The Eiffel Tower is located in Paris."

A possible corrupted version for pre-training might be: "The <extra_id_0> is located in <extra_id_1>."

And the expected output is: "<extra_id_0> Eiffel Tower <extra_id_1> Paris."

Pre-Training Objective Function

Let $\mathbf{x} = (x_1, x_2, \ldots, x_N)$ be the original uncorrupted token sequence. During pre-training, we corrupt $\mathbf{x}$ by replacing some contiguous spans with special tokens, resulting in the corrupted sequence $\mathbf{x}^{\text{mask}}$ .

The target output sequence, which the decoder must generate, contains the dropped spans in an autoregressive order, e.g.:

\mathbf{y} = (\texttt{<extra\_id\_0>}, w_1, \ldots, w_m, \texttt{<extra\_id\_1>}, z_1, \ldots, z_k, \dots)

where $(w_1, \ldots, w_m), (z_1, \ldots, z_k), \dots$ are the spans originally removed from $\mathbf{x}$ .

The T5 model (parameterized by $\theta$ ) computes the conditional probability

p_\theta(\mathbf{y} \mid \mathbf{x}^{\text{mask}}) = \prod_{t=1}^{|\mathbf{y}|} p_\theta(y_t \mid y_{<t}, \mathbf{x}^{\text{mask}}).

We train T5 to maximize the log-likelihood of the correct sequence $\mathbf{y}$ , or equivalently minimize the negative log-likelihood (cross-entropy):

\mathcal{L}(\theta) = -\sum_{t=1}^{|\mathbf{y}|} \log \, p_\theta(y_t \mid y_{<t}, \mathbf{x}^{\text{mask}}).

This span corruption objective encourages the model to learn how to generate missing pieces of text conditioned on the rest of the sequence, a crucial skill for downstream text-to-text tasks.

Fine-Tuning of T5

Fine-tuning in T5 follows a straightforward approach: the input text and target text are formatted according to the task, and the model is trained on supervised data.

Task-Specific Prompting

Each task is formatted as a text-to-text mapping using prefix prompts. Examples:

Sentiment Analysis

Input: "sst2 sentence: The movie was amazing!"
Output: "positive"

Summarization

Input: "summarize: The economy is experiencing a significant downturn..."
Output: "Economic downturn worsening."

Question Answering

Input: "question: Who discovered gravity? context: Isaac Newton discovered gravity in the 17th century."
Output: "Isaac Newton"

Translation (English → French):

Input: "translate English to French: How are you?"
Output: "Comment ça va?"

Fine-Tuning Objective Function

Each task is cast in a text-to-text format:

Task Prefix: A short prefix that tells T5 what to do (e.g., “translate English to German:”, “summarize:”, “sst2 sentence:” for sentiment).
Input Sequence: The actual text input or prompt.
Output Sequence: The expected text output (labels, summaries, translations, etc.).

During fine-tuning:

\min_\theta \; - \sum_{t=1}^{|\mathbf{y}^{(i)}|} \log \, p_\theta\bigl(y_t^{(i)} \mid y_{<t}^{(i)}, \mathbf{x}^{(i)}\bigr)

where $\mathbf{x}^{(i)}$ is the task-specific input (with a prefix), and $\mathbf{y}^{(i)}$ is the desired output text. The parameters $\theta$ are initialized from the pre-trained T5 weights.

Through this process, T5 transfers its learned span corruption capabilities to a wide variety of tasks. Because it is fully autoregressive in the decoder, it can produce free-form text (summaries, translations, etc.), while also being able to generate short “labels” (like sentiment classes).

Performance of T5

Benchmarks on NLP Tasks

T5 achieves state-of-the-art results on several NLP benchmarks:

Task	Dataset	Metric	T5 Score
Text Classification	SST-2	Accuracy	96.4%
Summarization	CNN/Daily Mail	ROUGE-L	41.1
Machine Translation	WMT14 En-Fr	BLEU	43.1
QA	SQuAD v1.1	F1-score	94.8

T5 outperforms BERT and GPT-based models on multi-task learning, showcasing its generalization capabilities.

Model Variants and Scaling

T5 comes in different sizes:

T5-Small (60M parameters)
T5-Base (220M)
T5-Large (770M)
T5-3B (3 Billion)
T5-11B (11 Billion)

Conclusion

T5’s text-to-text formulation makes it an elegant and powerful model for a wide variety of NLP tasks. It unifies different NLP tasks under a single model and shows state-of-the-art performance in many benchmarks.

For further exploration, you can experiment with Hugging Face's T5 implementation:

# !pip install datasets transformers[sentencepiece]
# !pip install sentencepiece
# !pip install torch

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_text = "summarize: The Eiffel Tower is a famous landmark in France."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)  # Expected output: "Eiffel Tower is a famous French landmark."

T5's flexibility, efficiency, and performance make it one of the most widely used Transformer-based models today. 🚀