Build a Transformer Decoder Block

Constructed a complete transformer decoder block with masked self-attention, feed-forward network, residual connections, pre-layer normalization, and dropout. Implemented both the original Post-LN and modern Pre-LN variants. Profiled memory and compute usage per component.

View Code

What I Built

Key Concepts

Masked Self-AttentionFeed-Forward NetworkResidual ConnectionsLayer NormalizationPre-LN vs Post-LNDropout

Architecture

Masked Multi-Head Attention

Feed-Forward Network

Residual Connections

Layer Normalization

Dropout Layers

Results

Pre-LN variant trains 40% faster and achieves better final loss. FFN consumes 66% of parameters but only 20% of compute during inference.

Key Learnings

Pre-normalization is crucial for training deep transformers
Residual connections enable gradient flow through deep stacks
FFN layers are parameter-heavy but compute-light during inference

Challenges

Implementing causal masking efficiently
Choosing between Pre-LN and Post-LN architectures
Balancing model depth vs. width

Back to Roadmap