Neural-Native Programming via Direct Interfaces to Transformer Internal Layers

Question

Suppose we stop treating text tokens as the only respectable programming surface for language models. What would it take to define a model-facing language whose primary carrier is a latent program object coupled directly to transformer internals?

Short answer

The idea is credible as a research program, but only if it is made much more severe than the fashionable version.

A workable first thesis is not “let the model emit vibes in vector space.” It is:

define a latent IR with explicit semantics and a deterministic compiler
use layer-specific read heads such as tuned-lens-style translators to decode it from internal states
use carefully bounded write primitives such as activation engineering, contrastive activation addition, or small learned adapters to influence generation
evaluate it with execution-based benchmarks such as HumanEval, MBPP, and APPS
treat the latent artifact itself as a security boundary

Without those constraints, the project degenerates into a steganographic side channel that occasionally compiles. One can admire the bandwidth and still refuse to call it a language.

What looks genuinely promising

1. Residual-stream interfaces are a plausible substrate

The transformer-circuits framework and the Geva key-value-memory paper make the central intuition legible: transformer blocks repeatedly read from and write to a shared residual workspace. That does not by itself give us a language, but it does make “internal program state” a coherent target rather than pure mysticism.

2. Tuned read heads are more serious than naive unembedding

The logit lens is the charmingly reckless ancestor. The tuned lens is the grown-up version. If NNPL wants a stable read interface, it should begin with layer-specific translators and explicit ablations instead of pretending that one global unembedding is a transparent window into all layers.

3. Discrete or typed latent spines deserve priority

The best design instinct in the source set is to prefer VQ-style codes, typed latent graphs, or vector-symbolic structures with explicit binding and cleanup over unconstrained continuous streams. Latent Programmer, Grammar VAE, HRR, HDC, Ross Gayler’s VSA account, and the newer Doug work all point in this direction. Pure continuity is easy to optimize and miserable to debug.

4. Activation-level write paths already exist

Activation engineering, contrastive activation addition, PPLM, pyvene, LoRA, and tracing toolkits such as nnsight make it realistic to define a write protocol without retraining an entire foundation model from scratch. The real question is not whether writing is possible. It plainly is. The real question is whether the resulting interface can be made stable, local, and auditable enough to deserve trust.

5. Execution-based evaluation is mandatory

HumanEval, MBPP, and APPS give the right shape of criterion: not prettiness, not reconstruction loss, but whether the compiled artifact actually works. If NNPL cannot beat or at least match token-code baselines under those measures, its elegance will remain mostly decorative.

What still looks shaky

1. Internal representations remain entangled

Toy Models of Superposition and sparse-feature work are helpful partly because they deny us the lazy fantasy that internal coordinates just are symbols. Whatever NNPL becomes, it must survive feature packing, basis drift, and model-version instability.

2. The interface could collapse back into “just another tokenizer”

A discrete codebook can improve robustness and auditability, but it can also become a glorified private vocabulary. The point of NNPL is not to replace English tokens with worse tokens. The point is to exploit geometry, composition, and internal locality in ways ordinary tokenization cannot.

3. Safety gets harder, not easier, unless the boundary is explicit

Representation engineering, inference-time intervention, and latent-knowledge work all support the uncomfortable claim that models may contain usable internal structure that ordinary outputs do not surface faithfully. This makes NNPL powerful and also rather dangerous. If the latent object is not logged, validated, and monitored, the compiled output will be only a partial record of what actually happened.

Most sensible initial experiment

Pick one model family and freeze it.
Define a very small typed latent IR: arithmetic, branching, function calls, and a tiny effect discipline.
Compile that IR to Python.
Train layer-specific read heads to decode candidate IR states from selected residual-stream sites.
Compare three write methods: no write path, activation engineering, and a small LoRA-style interface.
Evaluate on a narrow HumanEval/MBPP slice with perturbation tests on the latent state.
Log every latent artifact and reject anything that fails structural or type validation before compilation.

That experiment would not prove NNPL in the grand sense. It would, however, force the right early answer to the uncomfortable question: is there a real interface here, or merely a tasteful hallucination? The tighter staged version now lives in neural-native-programming-research-program.

Where this fits in the harness wiki

This topic belongs here because it suggests a deeper control-plane boundary than ordinary chat or even non-linear-interface-options-for-next-harness. Most of the wiki so far asks how to structure work around an LLM. NNPL asks whether some of that structure should eventually move below token text and into a model-facing IR that the harness can inspect, compile, and constrain. In that sense it is adjacent to formal-cognition-loop and theorem-proving-as-cognitive-kernel: another candidate narrowing space, but one aimed at neural execution rather than logical proof.

Verdict for now

Pursue it, but with suspicion.

The credible version is:

typed latent IR first
execution and auditability first
mid-layer interface selection by experiment, not aesthetics
explicit monitors at the latent boundary

The non-credible version is:

pure continuous latent streams
no compiler or validator
no benchmark comparison against token baselines
mystical talk about models “thinking in vectors” as though that settled anything

Read this with neural-native-programming, neural-native-programming-research-program, formal-cognition-loop, theorem-proving-as-cognitive-kernel, non-linear-interface-options-for-next-harness, new-harness-design-notes, harness-engineering, and formal-methods-for-agent-harnesses.

Agent Harness Wiki

Explorer

Neural-Native Programming via Direct Interfaces to Transformer Internal Layers

Question

Short answer

What looks genuinely promising

1. Residual-stream interfaces are a plausible substrate

2. Tuned read heads are more serious than naive unembedding

3. Discrete or typed latent spines deserve priority

4. Activation-level write paths already exist

5. Execution-based evaluation is mandatory

What still looks shaky

1. Internal representations remain entangled

2. The interface could collapse back into “just another tokenizer”

3. Safety gets harder, not easier, unless the boundary is explicit

Most sensible initial experiment

Where this fits in the harness wiki

Verdict for now

Graph View

Table of Contents

Backlinks

Agent Harness Wiki

Explorer

Neural-Native Programming via Direct Interfaces to Transformer Internal Layers

Question

Short answer

What looks genuinely promising

1. Residual-stream interfaces are a plausible substrate

2. Tuned read heads are more serious than naive unembedding

3. Discrete or typed latent spines deserve priority

4. Activation-level write paths already exist

5. Execution-based evaluation is mandatory

What still looks shaky

1. Internal representations remain entangled

2. The interface could collapse back into “just another tokenizer”

3. Safety gets harder, not easier, unless the boundary is explicit

Most sensible initial experiment

Where this fits in the harness wiki

Verdict for now

Related pages

Graph View

Table of Contents

Backlinks