Part 20
Completed

Sparse Autoencoders and Circuits

Implemented sparse autoencoders to decompose transformer activations into interpretable features. Traced computational circuits for specific tasks (induction, copy, relation extraction). Identified monosemantic and polysemantic neurons.

What I Built

Implemented sparse autoencoders to decompose transformer activations into interpretable features. Traced computational circuits for specific tasks (induction, copy, relation extraction). Identified monosemantic and polysemantic neurons.

Key Concepts

Sparse AutoencodersMechanistic InterpretabilityComputational CircuitsMonosemantic NeuronsPolysemantic NeuronsFeature Decomposition

Architecture

1
Sparse Autoencoder
2
Feature Extractor
3
Circuit Tracer
4
Activation Analyzer
5
Neuron Classifier

Results

Decomposed 512-dim activations into 4096 sparse features with 95% reconstruction. Identified clear induction heads and copy circuits. 30% of neurons are monosemantic.

Key Learnings

  • Sparse autoencoders reveal surprisingly interpretable features
  • Induction heads are a fundamental circuit in transformers
  • Polysemanticity is the norm, not the exception

Challenges

  • Evaluating feature interpretability objectively
  • Scaling sparse autoencoders to large models
  • Understanding polysemantic neurons