FlashAttention Comparison

Implemented FlashAttention and FlashAttention-2 from scratch, understanding the tiling, softmax rescaling, and memory access patterns that make it IO-aware. Benchmarked against standard attention across sequence lengths, batch sizes, and hardware configurations.

View Code

What I Built

Key Concepts

FlashAttentionTilingSoftmax RescalingIO-AwarenessMemory Access PatternsKernel Fusion

Architecture

Tiled Attention Kernel

Softmax Online Normalizer

Memory Scheduler

Benchmark Suite

Results

2.4x speedup and 10x memory reduction at 8k sequence length. Speedup increases with sequence length, reaching 7.8x at 32k.

Key Learnings

Memory access patterns matter more than FLOPs
Online softmax normalization is the key algorithmic insight
Hardware-aware design beats naive algorithmic improvements

Challenges

Implementing efficient tiling without materializing full attention matrix
Handling edge cases in block sizes
Profiling memory bandwidth vs. compute utilization

Back to Roadmap