Part 2
Completed

Sinusoidal, Learned, RoPE, and ALiBi Positional Methods

Implemented and compared four major positional encoding strategies: sinusoidal (Vaswani et al.), learned positional embeddings, Rotary Position Embedding (RoPE), and Attention with Linear Biases (ALiBi). Evaluated each on length generalization, extrapolation capability, and training stability.

What I Built

Implemented and compared four major positional encoding strategies: sinusoidal (Vaswani et al.), learned positional embeddings, Rotary Position Embedding (RoPE), and Attention with Linear Biases (ALiBi). Evaluated each on length generalization, extrapolation capability, and training stability.

Key Concepts

Sinusoidal EncodingLearned Positional EmbeddingsRotary Position Embedding (RoPE)ALiBiLength ExtrapolationRelative Position

Architecture

1
Positional Encoding Layer
2
RoPE Rotary Kernel
3
ALiBi Bias Generator
4
Extrapolation Tester

Results

RoPE showed best extrapolation to 2x training length with <5% perplexity degradation. ALiBi demonstrated superior training stability on long sequences.

Key Learnings

  • RoPE's relative nature makes it inherently better at length generalization
  • ALiBi's simplicity is deceptive—it performs remarkably well
  • Sinusoidal encodings still have value for fixed-length tasks

Challenges

  • Implementing RoPE efficiently without materializing full position matrices
  • Comparing methods fairly across different sequence lengths
  • Understanding why some methods fail at extrapolation