Sinusoidal Positional Encoding in the Transformer
Learn how sine and cosine functions bring sequence order to Transformer models with detailed explanations and examples.

Introduction
The Transformer architecture, introduced by Vaswani et al. in 2017, revolutionized natural language processing by utilizing self-attention mechanisms instead of traditional recurrent or convolutional neural networks. One critical component of the Transformer is the positional encoding, which injects information about the position of each token in a sequence. This article delves into the sinusoidal positional encoding method, providing detailed explanations, equations, and illustrative examples using small vectors.
The Necessity of Positional Encoding
Transformers process input sequences in parallel rather than sequentially. While this parallelism enhances computational efficiency, it also means that the model lacks inherent information about the order of tokens. In tasks like language translation or text summarization, understanding the sequence order is crucial. Positional encoding compensates for this by providing a way to include positional information in the input embeddings.
Sinusoidal Positional Encoding Explained
Sinusoidal positional encoding adds position-specific patterns to the embeddings using sine and cosine functions of different frequencies. This method is deterministic and requires no additional learned parameters, making it efficient and effective for capturing positional relationships.
Mathematical Formulation
For a sequence of length and an embedding dimension , the positional encoding at position for dimension is defined as:
where:
- is the token's position in the sequence (starting from 0).
- is the index of the pair of sine and cosine functions.
- is the dimension index in the embedding (from 0 to ).
- is the model's embedding size.
- scales the frequencies.
Key Properties
- Different Frequencies: Each pair of sine and cosine functions corresponds to a unique frequency.
- Relative Positioning: Enables the model to learn relative positions between tokens.
- Infinite Sequences: Can generalize to sequence lengths longer than those seen during training.
Step-by-Step Example with Small Vectors
Let's illustrate sinusoidal positional encoding with a simple example.
Parameters
- Sequence Length (): 5 tokens.
- Embedding Dimension (): 4 dimensions.
Calculating the Positional Encodings
We compute the positional encoding for each position (from 0 to 4) and each dimension (from 0 to 3).
Understanding the Dimension Indices
In the positional encoding formula, we use and to index into the dimensions of the embedding. This means that for each , we have two dimensions:
- Even Dimension (): Uses the sine function.
- Odd Dimension (): Uses the cosine function.
Since our embedding dimension , the possible values of are:
- For :
- (Dimension 0)
- (Dimension 1)
- For :
- (Dimension 2)
- (Dimension 3)
This covers all dimensions from 0 to .

Compute the Denominators
For each , calculate the scaling factor (denominator):
- For :
- For :
Calculate Values
For each position and index , compute:
Compute Sine and Cosine Values
- For Even Dimensions ():
- For Odd Dimensions ():
Positional Encodings for Each Position
Let's compute the positional encodings step by step.
Position
- For (Dimensions 0 and 1, Denominator = 1):
- Dimension 0 ():
- Dimension 1 ():
- For (Dimensions 2 and 3, Denominator = 100):
- Dimension 2 ():
- Dimension 3 ():
Position
- For (Denominator = 1):
- Dimension 0:
- Dimension 1:
- For (Denominator = 100):
- Dimension 2:
- Dimension 3:
similarly, we can calculate for pos=2, pos=3 and pos=4 as well.
Assembling the Positional Encoding Vectors
For each position , the positional encoding vector is:
Positional Encoding Matrix
Position | Dimension 0 | Dimension 1 | Dimension 2 | Dimension 3 |
---|---|---|---|---|
() | () | () | () | () |
0 | 0 | 1 | 0 | 1 |
1 | 0.8415 | 0.5403 | 0.00999983 | 0.99995 |
2 | 0.9093 | -0.4161 | 0.0199987 | 0.99980 |
3 | 0.1411 | -0.9899 | 0.0299955 | 0.99955 |
4 | -0.7568 | -0.6536 | 0.0399893 | 0.99920 |
Interpretation
- First Pair of Dimensions (0 & 1): Rapidly changing due to a smaller denominator (), capturing fine positional differences.
- Second Pair of Dimensions (2 & 3): Slowly changing due to a larger denominator (), capturing broader positional trends.
- Combination: Provides both local and global positional information.
Incorporating Positional Encodings into the Transformer
The positional encodings are added to the token embeddings before being fed into the Transformer layers:
This addition allows the model to consider both the semantic meaning of the tokens and their positions in the sequence.
Advantages of Sinusoidal Positional Encoding
- Deterministic: No additional parameters to learn, reducing complexity.
- Generalization: Can handle sequences longer than those seen during training.
- Relative Positioning: Facilitates learning of relative positions between tokens.
Conclusion
Sinusoidal positional encoding is a cornerstone of the Transformer architecture, enabling it to capture sequence order without compromising the parallelism that makes it so efficient. By leveraging sine and cosine functions of varying frequencies, this approach elegantly embeds both local and global positional information into token embeddings.