Do thought anchors correspond mechanistically to planning tokens in RL?
This explores whether the 'thought anchors' found by interpretability work (the sentences that actually steer a reasoning trace) are the same thing, mechanistically, as the 'planning tokens' that RL training optimizes — i.e. is one phenomenon described twice, under two methodologies?
This explores whether 'thought anchors' (found by poking at reasoning traces after the fact) and 'planning tokens' (the things RL training pushes on) are two names for one underlying mechanism. The corpus doesn't prove they're identical, but it lines up three independent lines of evidence that all point at the same small set of sentences — which is the interesting part.
Start with what each side found on its own. The interpretability route used counterfactual resampling, attention analysis, and causal suppression and converged on the same answer: a sparse handful of planning and backtracking sentences carry most of the steering power in a trace Which sentences actually steer a reasoning trace?. The RL route, watching eight models train, found that learning splits into two phases — first execution gets nailed down, then strategic planning becomes the bottleneck — and that planning-token entropy keeps rising while execution entropy flattens, with gains coming precisely from concentrating optimization on those planning tokens Does RL training follow a predictable two-phase learning sequence?. Two different methods, two different questions, and both land on 'the planning moments are where the leverage is.' That convergence is the strongest case that they're tracking the same structure.
A third, lower-level result tightens the link. Tokens like 'Wait' and 'Therefore' show sharp spikes in mutual information with the correct answer, and suppressing them hurts reasoning while suppressing equal numbers of random tokens does not Do reflection tokens carry more information about correct answers?. Those are exactly the lexical markers of backtracking and transition that the thought-anchors work flags as pivots — so the 'anchor' you find by causal surgery and the 'planning token' you find by information theory look like the same physical thing in the sequence.
But here's the twist the corpus adds, and it's a real one: RL may not be creating these anchors at all, only learning when to fire them. Evidence suggests base models already hold reasoning strategies in latent form, and RL post-training optimizes deployment timing rather than capability — hybrid models recover 91% of the gains by routing tokens only, and the activation vectors for reasoning strategies exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. If that's right, thought anchors are pre-existing structures and 'planning tokens in RL' is just the training signal that learns to place them well — correspondence, but not because RL built them.
And worth holding next to all of this: a skeptical result argues reasoning traces are stylistic mimicry, generated identically to any other output, with invalid traces routinely producing correct answers — i.e. the tokens correlate with answers via learned formatting, not functional computation Do reasoning traces actually cause correct answers?. That's in tension with the causal-suppression evidence, and the honest read is that the field hasn't fully reconciled 'these specific sentences are causally load-bearing' with 'traces in general aren't.' One way to thread it: the anchors are causal, the connective filler between them isn't. If you want the upstream version of the same question, lookahead and reinforcement-pretraining work shows planning structure can be planted into the data and into next-token prediction long before any RL stage Can embedding future information in training data improve planning? Can next-token prediction become a reasoning task with RL? — which suggests 'planning tokens' aren't owned by RL at all.
Sources 7 notes
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.
Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.