Part 4
Completed
Build a Transformer Decoder Block
Constructed a complete transformer decoder block with masked self-attention, feed-forward network, residual connections, pre-layer normalization, and dropout. Implemented both the original Post-LN and modern Pre-LN variants. Profiled memory and compute usage per component.
What I Built
Constructed a complete transformer decoder block with masked self-attention, feed-forward network, residual connections, pre-layer normalization, and dropout. Implemented both the original Post-LN and modern Pre-LN variants. Profiled memory and compute usage per component.
Key Concepts
Masked Self-AttentionFeed-Forward NetworkResidual ConnectionsLayer NormalizationPre-LN vs Post-LNDropout
Architecture
1
Masked Multi-Head Attention2
Feed-Forward Network3
Residual Connections4
Layer Normalization5
Dropout LayersResults
Pre-LN variant trains 40% faster and achieves better final loss. FFN consumes 66% of parameters but only 20% of compute during inference.
Key Learnings
- Pre-normalization is crucial for training deep transformers
- Residual connections enable gradient flow through deep stacks
- FFN layers are parameter-heavy but compute-light during inference
Challenges
- Implementing causal masking efficiently
- Choosing between Pre-LN and Post-LN architectures
- Balancing model depth vs. width