Part 20
Completed
Sparse Autoencoders and Circuits
Implemented sparse autoencoders to decompose transformer activations into interpretable features. Traced computational circuits for specific tasks (induction, copy, relation extraction). Identified monosemantic and polysemantic neurons.
What I Built
Implemented sparse autoencoders to decompose transformer activations into interpretable features. Traced computational circuits for specific tasks (induction, copy, relation extraction). Identified monosemantic and polysemantic neurons.
Key Concepts
Sparse AutoencodersMechanistic InterpretabilityComputational CircuitsMonosemantic NeuronsPolysemantic NeuronsFeature Decomposition
Architecture
1
Sparse Autoencoder2
Feature Extractor3
Circuit Tracer4
Activation Analyzer5
Neuron ClassifierResults
Decomposed 512-dim activations into 4096 sparse features with 95% reconstruction. Identified clear induction heads and copy circuits. 30% of neurons are monosemantic.
Key Learnings
- Sparse autoencoders reveal surprisingly interpretable features
- Induction heads are a fundamental circuit in transformers
- Polysemanticity is the norm, not the exception
Challenges
- Evaluating feature interpretability objectively
- Scaling sparse autoencoders to large models
- Understanding polysemantic neurons