Part 3
Completed

Scaled Dot-Product Attention

Built scaled dot-product attention from scratch with Q, K, V projections, softmax normalization, and scaling factor. Analyzed attention patterns on synthetic tasks (copying, sorting, pattern matching) to understand what attention learns at the single-head level.

What I Built

Built scaled dot-product attention from scratch with Q, K, V projections, softmax normalization, and scaling factor. Analyzed attention patterns on synthetic tasks (copying, sorting, pattern matching) to understand what attention learns at the single-head level.

Key Concepts

Query-Key-ValueSoftmax NormalizationScaling FactorAttention WeightsPattern Matching

Architecture

1
Q/K/V Projection Layers
2
Attention Score Calculator
3
Softmax Normalizer
4
Output Combiner

Results

Single attention head achieves 99.8% accuracy on copy tasks. Attention maps reveal clear diagonal and vertical patterns for different syntactic roles.

Key Learnings

  • The scaling factor sqrt(d_k) is essential for training stability
  • Attention weights are interpretable—patterns correspond to linguistic functions
  • Single heads specialize on specific syntactic relations

Challenges

  • Numerical stability in softmax for large attention scores
  • Understanding attention head specialization
  • Visualizing attention patterns effectively