SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Can models treat long prompts as external code environments?

Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?

Synthesis note · 2026-02-23 · sourced from Inference time scaling
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Context rot — quality degradation as context lengthens — affects even frontier models like GPT-5. Extending context windows is an arms race: each increase buys more capacity but doesn't solve the fundamental problem that attention-based processing degrades with length. Recursive Language Models sidestep this entirely by changing where the context lives.

The key insight: long prompts should not be fed into the transformer directly. Instead, they should be treated as part of an external environment that the model can symbolically interact with. In the RLM implementation, the prompt is stored as a variable in a Python REPL. The model reads, filters, chunks, and queries its context through code execution rather than token-space attention.

Two mechanisms make this work:

Model priors enable context filtering without seeing it. The model uses its existing knowledge to construct targeted queries — regex searches for keywords, printing specific line ranges to inspect, narrowing the search space based on task understanding. It doesn't need to attend to 100K tokens to find the relevant 500. This is analogous to how humans skim a long document: prior knowledge guides where to look.

Recursive sub-calls defer unbounded reasoning chains. When the context requires reasoning over multiple chunks, the model spawns sub-RLM calls, each operating on a manageable portion. The decomposition is dynamic — the model decides how to partition based on what it observes, not a predefined chunking strategy.

Results: RLMs handle inputs up to two orders of magnitude beyond model context windows. On shorter prompts (within context limits), RLMs still dramatically outperform base models and common long-context scaffolds including context compaction. The cost is comparable or cheaper per query because the model processes only the relevant portions of context rather than attending to everything.

This connects to Can models precompute answers before users ask questions? as a second reframing of compute allocation: sleep-time asks WHEN to compute (before vs during query); RLMs ask WHERE to keep the data (model's context vs external environment). Both reject the default of "stuff everything into the context window at query time."

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 128 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

recursive language models treat long prompts as external environment enabling programmatic interaction 100x beyond context windows