INQUIRING LINE

What makes a model fail to activate relevant skills from its own harness?

This explores why a model that already holds a relevant capability — a reasoning step, a stored fact, a usable skill — fails to fire it at the moment it's needed, rather than why it lacks the capability at all.


This explores the gap between what a model *can* do and what it actually *does* in the moment — the failure isn't missing capability, it's a capability sitting unused inside the model's own repertoire. The corpus is surprisingly consistent that this is a real and distinct failure mode. The clearest statement is that many reasoning failures are inference bottlenecks, not knowledge gaps: models possess the relevant facts but won't activate them without explicit prompting, and just adding subtle emphasis or forcing the model to enumerate preconditions recovers double-digit accuracy Why do language models fail to use knowledge they possess?. The skill is in there; the trigger to reach for it isn't.

The sharpest version of this is the knowing-doing gap. Models generate the correct rationale 87% of the time but follow their own reasoning only 64% of the time — they literally narrate the right move and then act greedily against it Why do language models fail to act on their own reasoning?. So one answer to 'what makes activation fail' is that knowing and doing run on separate tracks: producing a plan doesn't guarantee executing it, and frequency bias and greediness pull the model toward familiar-but-wrong actions. This connects to a deeper architectural point — only after post-training do models start treating their own outputs as actions that shape what comes next, closing the perception-action loop that lets a skill actually get *deployed* rather than just described Do models recognize their own outputs as actions shaping future inputs?.

A second cluster says the context itself can suppress activation. When a model's own prior errors fill its context window, performance degrades non-linearly — the model conditions on its mistakes and keeps reaching for the wrong thing, and scaling doesn't fix it Do models fail worse when their own errors fill the context?. The skill is intact, but a polluted context biases which skill gets invoked. There's a related structural ceiling in interactive settings: models are dramatically worse at *active* reasoning (asking the right question, probing) than at passive reasoning, and SFT, DPO, and Tree-of-Thought barely move it — suggesting some activation failures are baked into how the model engages, not fixable by prompting Why do models fail at asking good questions during interaction?.

The most useful reframe in the corpus is that 'skills' often need to be made explicit and situated before a frozen model will use them. Extracting natural-language rules from context into reusable skills lifts frozen-model reasoning with no weight updates Can frozen models learn better by extracting context into skills?, and a recurring finding is that skills authored *offline* fail because they're divorced from the exact runtime situation — authoring a skill inside the agent's own loop, grounded in immediate feedback and runtime validation, is what closes the gap between having a skill and invoking it correctly Does creating skills inside the agent loop eliminate mismatches?. In other words, activation fails when the skill isn't anchored to the moment it's supposed to fire.

Worth knowing: the line between 'failed to activate an existing skill' and 'never had the skill' is itself blurry and task-dependent. Reinforcement learning mostly just *activates latent* abilities already present in the base model for standard reasoning — but for deep multi-step planning it generates genuinely new strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. And training can actively *break* activation: overly hard RLVR samples teach degenerate shortcuts that contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. So the same harness that holds a skill can be the thing that learns to route around it.


Sources 9 notes

Why do language models fail to use knowledge they possess?

Models possess relevant knowledge but fail to activate it without explicit prompting. Adding subtle emphasis recovers 15.3 percentage points accuracy, and forcing enumeration of preconditions recovers 6-9 points, showing the bottleneck is in constraint inference, not storage.

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Why do models fail at asking good questions during interaction?

GPT-4o achieves only 35% on interactive number guessing, with information gains collapsing from 7.7% to 2.5% as rounds progress. SFT, DPO, and Tree-of-Thought interventions provide minimal improvement, suggesting the deficit is structural rather than a prompting or fine-tuning problem.

Can frozen models learn better by extracting context into skills?

Extracting natural-language rules from context into reusable skills improves frozen model reasoning without weight updates. On CL-bench, this lifts GPT-4.1 from 11.1% to 16.5%, with skills transferable across model backbones.

Does creating skills inside the agent loop eliminate mismatches?

MUSE-Autoskill demonstrates that invoking skill creation from within the agent's reasoning loop grounds new skills in exact task context, immediate feedback, and runtime validation. In-loop skills reach 87.94% task accuracy and transfer to other agents with minimal loss, eliminating the situated context problem of offline authoring.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability analyst. The question: **What makes a model fail to activate relevant skills from its own harness?** — treat this as still-open, because the failure mode itself may have shifted as model scale, training regimes, and harness complexity evolved.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; many are recent but already dated in fast-moving capability work.
- Models generate correct reasoning 87% of the time but execute greedily against it only 64% — a knowing-doing gap where narration and action decouple (~2026).
- Inference-time skill augmentation (extracting rules from context into reusable skills) lifts frozen-model reasoning without weight updates (~2025).
- Context pollution from prior errors non-linearly degrades performance; the skill remains but is suppressed by self-conditioning (~2024).
- Active reasoning (probing, question-asking) is dramatically harder than passive reasoning; SFT, DPO, Tree-of-Thought barely move it (~2025).
- Post-training shifts models from passive prediction to enaction — treating their own outputs as actions that shape what comes next — enabling skill deployment (~2026).

Anchor papers (verify; mind their dates):
- 2026-03 arXiv:2603.29025 "The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning"
- 2026-05 arXiv:2605.27366 "MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation"
- 2026-05 arXiv:2605.25459 "From Simulation to Enaction: Post-trained Language Models Recognize and React to their own"
- 2024-12 arXiv:2412.02674 "Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models"

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer inference engines (batch decoding, speculative sampling), tooling (step-wise grounding frameworks, memory architectures like persistent KV caches), orchestration (multi-agent loops with shared skill registries), or evaluations (live task grounding vs. synthetic benchmarks) have since RELAXED or OVERTURNED it. Separate the durable question (likely: *why does knowing ≠ doing even in post-trained models?*) from the perishable limitation (e.g., *frozen models can't activate skills* — may be obsolete if dynamic routing or in-context skill binding now works). Cite what resolved it; say plainly where a constraint still holds.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has any recent paper shown that the knowing-doing gap closes under specific training or deployment regimes? Does active reasoning now work in certain harness designs?

(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., *If post-training now closes the enaction loop, does the gap shift from narration-to-action to long-horizon multi-skill sequencing?* or *Do runtime skill-binding mechanisms (e.g., dynamic skill selection from a registry) eliminate context-pollution bottlenecks that static augmentation leaves open?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines