INQUIRING LINE

Can reinforcement learning close the gap between LLM reasoning and action?

This explores whether RL can turn an LLM's ability to *reason about* a problem into the ability to reliably *act* on it — the 'knowing-doing' gap — and what the corpus says RL actually changes inside the model when it tries.


This reads the question as the knowing-doing gap: an LLM can often articulate the right reasoning yet fail to execute it as competent action. The most direct evidence that RL can close this is Think-In Games Can language modeling close the knowing-doing gap in AI?, where LLMs generate language-guided policies that get refined by environmental feedback — declarative knowledge ('I know what should happen') becomes procedural competence ('I can make it happen'), while staying explainable at each step. In a similar spirit, RLAG Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning? shows RL internalizes coherent knowledge better than supervised fine-tuning because it rewards reasoning quality rather than token-level correctness — it teaches the model to *use* knowledge, not just recite it.

But here's the twist a curious reader might not expect: a cluster of papers argues RL doesn't add reasoning at all — it selects from what's already there. RLVR improves which solutions get sampled without expanding the set of solvable problems; base models actually beat RL-tuned models at high sampling budgets Does RLVR actually expand what models can reason about?. The dynamics work shows a single training example, or even spurious rewards, can trigger the gains What does reward learning actually do to model reasoning?, and five independent methods all converge on the same conclusion: post-training *elicits* latent capability rather than creating it Do base models already contain hidden reasoning ability?. Mechanistically, RL touches only 5–30% of parameters in stable, structured subnetworks Does reinforcement learning update only a small fraction of parameters? — it's a precise selection, not a rewrite.

Reframe the question through that lens and the answer sharpens: RL closes the reasoning-action gap not by teaching new reasoning but by converting reasoning the model already has into reliable, feedback-shaped behavior. The reasoning was latent; action requires committing to it under environmental consequences, and that's what reward provides.

There are real ceilings, though. Reasoning LLMs 'wander' rather than search systematically, so success drops exponentially as problems deepen Why do reasoning LLMs fail at deeper problem solving? — RL that only narrows sampling won't fix a model that lacks validity and necessity in its exploration. And the gap may be semantic at root: LLMs reason through token associations, not symbolic logic, collapsing when meaning is stripped away Do large language models reason symbolically or semantically?. RL can't reward its way past a representational limit.

Worth knowing for where this goes next: action competence may not require weight updates at all. AgentFly reaches 87.88% on GAIA by doing RL-style credit assignment entirely through episodic memory, leaving the LLM's parameters frozen Can agents learn continuously from experience without updating weights?. If reasoning is already latent and RL mostly selects, then closing the reasoning-action gap might increasingly happen in memory and tool-orchestration Can modular cognitive tools unlock reasoning without training? rather than in the weights — which quietly reframes what 'reinforcement learning' even needs to be.


Sources 10 notes

Can language modeling close the knowing-doing gap in AI?

Think-In Games demonstrates that when LLMs generate language-guided policies refined by environmental feedback, they develop procedural competence while retaining explainability. The approach dramatically reduces data demands and makes agent reasoning transparent at every step.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Next inquiring lines