Can language modeling close the knowing-doing gap in AI?
Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?
A central paradox in current AI: LLMs excel at complex reasoning (math, code) yet often fail at simple interactive tasks that young children perform effortlessly. Conversely, traditional RL agents acquire procedural knowledge through environmental interaction but operate as black boxes. The split is between declarative knowledge (knowing about something — what LLMs do well) and procedural knowledge (knowing how to do something — what RL agents do well).
Think-In Games (2508.21365) reformulates the bridge as a language modeling task. The LLM generates language-guided policies. These policies are refined iteratively through online reinforcement learning based on environmental feedback. The result: LLMs develop procedural understanding through direct interaction with the game environment while retaining their inherent reasoning and explanatory abilities. Critically, the policy is language, so the agent can explain its decisions at every step.
The architectural move is consequential. Traditional RL outputs actions; the policy is opaque. TiG outputs language describing actions; the policy is transparent. The environmental reward refines the language-policy directly — the language IS the policy parameterization. This means the agent's procedural competence becomes inspectable in the way declarative knowledge already was.
Two consequences. First, dramatically lower data and computational demands compared to conventional RL methods — because the LLM brings strong priors about what kinds of policies are reasonable, RL training only needs to refine those priors against environmental signal, not learn from scratch. Second, step-by-step natural language explanations for decisions improve transparency and interpretability — the same property that makes LLMs trustworthy in declarative tasks now extends to procedural ones.
The deeper claim is about the nature of intelligence: declarative and procedural knowledge are not categorically separate substrates that need joining — they can be unified if procedural competence is parameterized in the same medium (language) as declarative competence. The reward gradient refines the language; the language is the procedure.
This connects to Why do language models fail to act on their own reasoning?: the knowing-doing gap (declarative ≠ procedural in current LLMs) is exactly what TiG addresses. Where the greedy-agents paper diagnoses the gap as architectural, TiG argues it's a training-objective gap that RL on language-policy can close. Both find that RLFT narrows the gap — TiG provides the mechanism for why.
For MOBA-game macro-level reasoning specifically, the LLM brings strategic-thinking priors; RL refines them against game outcomes; explanations come for free.
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does the knowing-doing gap widen as tasks become more complex?
- What role does natural language play in breaking reinforcement learning performance plateaus?
- What data presentation structures enable LLMs to learn decision-making from examples?
- How do knowing and doing diverge in LLM decision-making?
- Can reinforcement learning close the gap between LLM reasoning and action?
- How does the knowing-doing gap relate to Potemkin understanding?
- What distinguishes communicative acts from operational actions in agentic LLMs?
- How do perception and execution gaps limit current AI agent performance?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
diagnoses the gap; TiG provides the architectural fix (language-as-policy + RL)
-
Does thinking emerge when agents choose between learned sub-policies?
Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
TiG instantiates this theoretical result: LLM sub-policies (strategic-reasoning patterns) become selectable through RL refinement
-
Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
TiG specializes the next-state signal pattern to language-policy refinement in game environments
-
How does treating LLMs as multi-step agents change what we can optimize?
Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
TiG is a concrete agentic-RL implementation in the new POMDP framing
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models
- Cognitive Architectures for Language Agents
- React - Synergizing Reasoning And Acting In Language Models
- Dynamic Planning with a LLM
- AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
- Reinforced Language Models for Sequential Decision Making
- Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning
- On the Limits of Innate Planning in Large Language Models
Original note title
RL bridges the declarative-procedural knowledge gap by reformulating decision-making as language modeling with environmental feedback