Part 2
Completed
Sinusoidal, Learned, RoPE, and ALiBi Positional Methods
Implemented and compared four major positional encoding strategies: sinusoidal (Vaswani et al.), learned positional embeddings, Rotary Position Embedding (RoPE), and Attention with Linear Biases (ALiBi). Evaluated each on length generalization, extrapolation capability, and training stability.
What I Built
Implemented and compared four major positional encoding strategies: sinusoidal (Vaswani et al.), learned positional embeddings, Rotary Position Embedding (RoPE), and Attention with Linear Biases (ALiBi). Evaluated each on length generalization, extrapolation capability, and training stability.
Key Concepts
Sinusoidal EncodingLearned Positional EmbeddingsRotary Position Embedding (RoPE)ALiBiLength ExtrapolationRelative Position
Architecture
1
Positional Encoding Layer2
RoPE Rotary Kernel3
ALiBi Bias Generator4
Extrapolation TesterResults
RoPE showed best extrapolation to 2x training length with <5% perplexity degradation. ALiBi demonstrated superior training stability on long sequences.
Key Learnings
- RoPE's relative nature makes it inherently better at length generalization
- ALiBi's simplicity is deceptive—it performs remarkably well
- Sinusoidal encodings still have value for fixed-length tasks
Challenges
- Implementing RoPE efficiently without materializing full position matrices
- Comparing methods fairly across different sequence lengths
- Understanding why some methods fail at extrapolation