Toy Models of Superposition

Source: Anthropic / Transformer Circuits Date: 2022-01-01

Summary

This essay argues that neural networks can pack many more features into a representation space than naive one-neuron-one-feature pictures would suggest. The resulting superposition explains why internal interfaces are powerful but also treacherously entangled.

Why it matters here

NNPL cannot assume that coordinates or even individual learned features have stable symbolic meanings. Any direct internal language has to contend with packed, overlapping, basis-dependent structure.