Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Source: Anthropic / Transformer Circuits Date: 2023-01-01

Summary

This work applies sparse dictionary learning to model activations in pursuit of more stable, more interpretable features than raw neurons provide. It is one of the main routes by which internal activations might be turned into something closer to a manipulable primitive vocabulary.

Why it matters here

If NNPL wants interpretable latent primitives rather than opaque activation sludge, sparse-feature discovery is one of the best available candidate substrates.