Sinusoidal Positional Encoding in the Transformer

Learn how sine and cosine functions bring sequence order to Transformer models with detailed explanations and examples.

Cover

Introduction

The Transformer architecture, introduced by Vaswani et al. in 2017, revolutionized natural language processing by utilizing self-attention mechanisms instead of traditional recurrent or convolutional neural networks. One critical component of the Transformer is the positional encoding, which injects information about the position of each token in a sequence. This article delves into the sinusoidal positional encoding method, providing detailed explanations, equations, and illustrative examples using small vectors.

The Necessity of Positional Encoding

Transformers process input sequences in parallel rather than sequentially. While this parallelism enhances computational efficiency, it also means that the model lacks inherent information about the order of tokens. In tasks like language translation or text summarization, understanding the sequence order is crucial. Positional encoding compensates for this by providing a way to include positional information in the input embeddings.

Sinusoidal Positional Encoding Explained

Sinusoidal positional encoding adds position-specific patterns to the embeddings using sine and cosine functions of different frequencies. This method is deterministic and requires no additional learned parameters, making it efficient and effective for capturing positional relationships.

Mathematical Formulation

For a sequence of length NN and an embedding dimension dmodeld_{\text{model}}, the positional encoding PE\text{PE} at position pospos for dimension dimdim is defined as:

PE(pos,2i)=sin(pos100002idmodel)PE(pos,2i+1)=cos(pos100002idmodel)\begin{aligned} \text{PE}_{(pos, 2i)} &= \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \\ \text{PE}_{(pos, 2i+1)} &= \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \end{aligned}

where:

Key Properties

Step-by-Step Example with Small Vectors

Let's illustrate sinusoidal positional encoding with a simple example.

Parameters

Calculating the Positional Encodings

We compute the positional encoding for each position pospos (from 0 to 4) and each dimension dimdim (from 0 to 3).

Understanding the Dimension Indices

In the positional encoding formula, we use 2i2i and 2i+12i + 1 to index into the dimensions of the embedding. This means that for each ii, we have two dimensions:

Since our embedding dimension dmodel=4d_{\text{model}} = 4, the possible values of ii are:

This covers all dimensions from 0 to dmodel1d_{\text{model}} - 1.

Sinusoidal Positional Encoding

Compute the Denominators

For each ii, calculate the scaling factor (denominator):

Denominatori=100002idmodel\text{Denominator}_i = 10000^{\frac{2i}{d_{\text{model}}}} 2idmodel=2×04=0Denominator0=100000=1\frac{2i}{d_{\text{model}}} = \frac{2 \times 0}{4} = 0 \\ \text{Denominator}_0 = 10000^{0} = 1 2idmodel=2×14=0.5Denominator1=100000.5=100\frac{2i}{d_{\text{model}}} = \frac{2 \times 1}{4} = 0.5 \\ \text{Denominator}_1 = 10000^{0.5} = 100

Calculate θ\theta Values

For each position pospos and index ii, compute:

θ(pos,i)=posDenominatori\theta_{(pos, i)} = \frac{pos}{\text{Denominator}_i}

Compute Sine and Cosine Values

PE(pos,2i)=sin(θ(pos,i))\text{PE}_{(pos, 2i)} = \sin(\theta_{(pos, i)}) PE(pos,2i+1)=cos(θ(pos,i))\text{PE}_{(pos, 2i+1)} = \cos(\theta_{(pos, i)})

Positional Encodings for Each Position

Let's compute the positional encodings step by step.

Position pos=0pos = 0

Position pos=1pos = 1

similarly, we can calculate for pos=2, pos=3 and pos=4 as well.

Assembling the Positional Encoding Vectors

For each position pospos, the positional encoding vector is:

PE(pos)=[PE(pos,0),PE(pos,1),PE(pos,2),PE(pos,3)]\text{PE}_{(pos)} = [\text{PE}_{(pos,0)}, \text{PE}_{(pos,1)}, \text{PE}_{(pos,2)}, \text{PE}_{(pos,3)}]

Positional Encoding Matrix

PositionDimension 0Dimension 1Dimension 2Dimension 3
(pospos)(sin\sin)(cos⁡\cos)(sin⁡\sin)(cos⁡\cos)
00101
10.84150.54030.009999830.99995
20.9093-0.41610.01999870.99980
30.1411-0.98990.02999550.99955
4-0.7568-0.65360.03998930.99920

Interpretation

Incorporating Positional Encodings into the Transformer

The positional encodings are added to the token embeddings before being fed into the Transformer layers:

InputTransformer=Embedding+Positional Encoding\text{Input}_{\text{Transformer}} = \text{Embedding} + \text{Positional Encoding}

This addition allows the model to consider both the semantic meaning of the tokens and their positions in the sequence.

Advantages of Sinusoidal Positional Encoding

Conclusion

Sinusoidal positional encoding is a cornerstone of the Transformer architecture, enabling it to capture sequence order without compromising the parallelism that makes it so efficient. By leveraging sine and cosine functions of varying frequencies, this approach elegantly embeds both local and global positional information into token embeddings.