Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Source: Anthropic / Transformer Circuits Date: 2023-01-01
Summary
This work applies sparse dictionary learning to model activations in pursuit of more stable, more interpretable features than raw neurons provide. It is one of the main routes by which internal activations might be turned into something closer to a manipulable primitive vocabulary.
Why it matters here
If NNPL wants interpretable latent primitives rather than opaque activation sludge, sparse-feature discovery is one of the best available candidate substrates.