Scaled Dot-Product Attention

Built scaled dot-product attention from scratch with Q, K, V projections, softmax normalization, and scaling factor. Analyzed attention patterns on synthetic tasks (copying, sorting, pattern matching) to understand what attention learns at the single-head level.

View Code

What I Built

Key Concepts

Query-Key-ValueSoftmax NormalizationScaling FactorAttention WeightsPattern Matching

Architecture

Q/K/V Projection Layers

Attention Score Calculator

Softmax Normalizer

Output Combiner

Results

Single attention head achieves 99.8% accuracy on copy tasks. Attention maps reveal clear diagonal and vertical patterns for different syntactic roles.

Key Learnings

The scaling factor sqrt(d_k) is essential for training stability
Attention weights are interpretable—patterns correspond to linguistic functions
Single heads specialize on specific syntactic relations

Challenges

Numerical stability in softmax for large attention scores
Understanding attention head specialization
Visualizing attention patterns effectively

Back to Roadmap