Part 3
Completed
Scaled Dot-Product Attention
Built scaled dot-product attention from scratch with Q, K, V projections, softmax normalization, and scaling factor. Analyzed attention patterns on synthetic tasks (copying, sorting, pattern matching) to understand what attention learns at the single-head level.
What I Built
Built scaled dot-product attention from scratch with Q, K, V projections, softmax normalization, and scaling factor. Analyzed attention patterns on synthetic tasks (copying, sorting, pattern matching) to understand what attention learns at the single-head level.
Key Concepts
Query-Key-ValueSoftmax NormalizationScaling FactorAttention WeightsPattern Matching
Architecture
1
Q/K/V Projection Layers2
Attention Score Calculator3
Softmax Normalizer4
Output CombinerResults
Single attention head achieves 99.8% accuracy on copy tasks. Attention maps reveal clear diagonal and vertical patterns for different syntactic roles.
Key Learnings
- The scaling factor sqrt(d_k) is essential for training stability
- Attention weights are interpretable—patterns correspond to linguistic functions
- Single heads specialize on specific syntactic relations
Challenges
- Numerical stability in softmax for large attention scores
- Understanding attention head specialization
- Visualizing attention patterns effectively