Part 4
Completed

Build a Transformer Decoder Block

Constructed a complete transformer decoder block with masked self-attention, feed-forward network, residual connections, pre-layer normalization, and dropout. Implemented both the original Post-LN and modern Pre-LN variants. Profiled memory and compute usage per component.

What I Built

Constructed a complete transformer decoder block with masked self-attention, feed-forward network, residual connections, pre-layer normalization, and dropout. Implemented both the original Post-LN and modern Pre-LN variants. Profiled memory and compute usage per component.

Key Concepts

Masked Self-AttentionFeed-Forward NetworkResidual ConnectionsLayer NormalizationPre-LN vs Post-LNDropout

Architecture

1
Masked Multi-Head Attention
2
Feed-Forward Network
3
Residual Connections
4
Layer Normalization
5
Dropout Layers

Results

Pre-LN variant trains 40% faster and achieves better final loss. FFN consumes 66% of parameters but only 20% of compute during inference.

Key Learnings

  • Pre-normalization is crucial for training deep transformers
  • Residual connections enable gradient flow through deep stacks
  • FFN layers are parameter-heavy but compute-light during inference

Challenges

  • Implementing causal masking efficiently
  • Choosing between Pre-LN and Post-LN architectures
  • Balancing model depth vs. width