SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Training, RL, and Test-Time Scaling

Where do memorization errors arise in chain-of-thought reasoning?

Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.

Synthesis note · 2026-02-23 · sourced from Memory
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

STIM (Source-aware Token-level Identification of Memorization) argues that memorization in long CoT generations must be identified at the token level, not the sequence level. A single faulty token — produced by memorization rather than reasoning — can trigger cascading errors through subsequent steps. Existing metrics report a single score for the entire sequence, missing where and why individual tokens go wrong.

Three distinct memorization sources influence each token:

  1. Local memorization — frequent continuations of immediately preceding tokens. The model generates the next token based on statistical co-occurrence with its local context, not reasoning. This is the dominant error source, responsible for up to 67% of wrong tokens.

  2. Mid-range memorization — tokens that frequently co-occur with the generation prefix. The model has seen this pattern in pretraining and reproduces it, even when the current reasoning context requires a different continuation.

  3. Long-range memorization — frequent co-occurrence with tokens in the input prompt. The prompt triggers a familiar pattern from pretraining that overrides the reasoning chain.

Key distributional findings:

This connects to the broader reasoning trace reliability cluster. Since Which sentences actually steer a reasoning trace?, STIM adds a complementary mechanism: specific tokens at the sub-sentence level carry memorization-driven influence that can derail even well-structured reasoning chains. The failure is more granular than thought-level — it operates at individual tokens.

The practical implication: high memorization scores are strong indicators of reasoning failures (measured via Precision@k and Recall@k). This offers a potential diagnostic tool for identifying where reasoning chains are unreliable, independent of whether the final answer is correct. This diagnostic capability directly addresses the faithfulness problem: since Do language models actually use their reasoning steps?, STIM's memorization scores provide a token-level mechanism for faithfulness failure — memorized tokens are causally unnecessary (the answer was determined by pattern-matching, not reasoning) and causally insufficient (the memorized continuation may diverge from the reasoning the chain appears to perform).

Inquiring lines that use this note as a source 94

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 135 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

token-level memorization in CoT reasoning has three distinct sources and local memorization causes up to 67 percent of reasoning errors