Sparse Autoencoders and Circuits

Implemented sparse autoencoders to decompose transformer activations into interpretable features. Traced computational circuits for specific tasks (induction, copy, relation extraction). Identified monosemantic and polysemantic neurons.

View Code

What I Built

Key Concepts

Sparse AutoencodersMechanistic InterpretabilityComputational CircuitsMonosemantic NeuronsPolysemantic NeuronsFeature Decomposition

Architecture

Sparse Autoencoder

Feature Extractor

Circuit Tracer

Activation Analyzer

Neuron Classifier

Results

Decomposed 512-dim activations into 4096 sparse features with 95% reconstruction. Identified clear induction heads and copy circuits. 30% of neurons are monosemantic.

Key Learnings

Sparse autoencoders reveal surprisingly interpretable features
Induction heads are a fundamental circuit in transformers
Polysemanticity is the norm, not the exception

Challenges

Evaluating feature interpretability objectively
Scaling sparse autoencoders to large models
Understanding polysemantic neurons

Back to Roadmap