INQUIRING LINE

What other agent behaviors besides citations reveal reasoning quality?

This explores what an agent *does* — beyond what it cites — that exposes whether its reasoning is genuine or just performance: the reads it doesn't cite, its intermediate steps, its search behavior, and its tells when it's faking depth.


This explores what an agent *does* — beyond what it cites — that exposes whether its reasoning is genuine or just performance. The corpus is unusually pointed on this, because a recurring theme is that the visible output (the answer, the cited sources, even the reasoning trace itself) is the *least* trustworthy signal. The richest signals live in behavior the agent didn't intend as evidence.

The sharpest example is what an agent reads but *doesn't* cite. Can search agent behavior yield reliable process rewards for reasoning? mines reasoning quality precisely from the hard distractors a search agent encountered and rejected — the near-misses it had to reason its way past. Handling a tempting wrong source well is a stronger tell than citing the right one, because anyone can cite; only a good reasoner discriminates. The flip side is the fabrication tell: Why do deep research agents fabricate scholarly content? found that 39% of agent failures are *strategic invention* — making up examples, products, and evidence to mimic scholarly rigor when real depth is demanded. So both the rejection of bad sources and the invention of fake ones are behavioral fingerprints of reasoning quality, in opposite directions.

A second family of signals lives in the intermediate steps rather than the destination. Where do reasoning agents actually fail during long traces? showed that checking intermediate states and policy compliance *during* generation raised success from 32% to 87% — because most failures are process violations, not wrong final answers. Does supervised fine-tuning improve reasoning or just answers? sharpens this with a measurable behavior: "Information Gain" per step. Fine-tuned models can raise final accuracy while cutting information gain by 39%, meaning they reach correct answers by post-hoc rationalization rather than steps that actually move the inference forward. The behavior to watch isn't whether each step looks reasonable — Do reasoning traces show how models actually think? is the warning here: invalid logical steps perform almost as well as valid ones, so reasoning traces are often stylistic mimicry. What matters is whether each step *earns its keep* by reducing uncertainty.

A third signal is how the agent budgets and explores. Does search budget scale like reasoning tokens for answer quality? and Do search steps follow the same scaling rules as reasoning tokens? show that search depth scales answer quality on the same diminishing-returns curve as reasoning tokens — so *how much and how persistently* an agent searches is itself a quality axis, not just plumbing. And Why do reasoning systems keep discovering new connections? points to a more exotic behavioral tell: an agent doing real reasoning keeps generating *semantically surprising* connections (~12% of edges stay surprising even once structurally linked). A reasoner that stops surfacing the unexpected has stopped reasoning and started retrieving.

The through-line worth taking away: the most reliable behavioral signals are the ones the agent can't easily fake for the grader. Citations, polished traces, and confident final answers are all gameable — which is why Can code become the operational substrate for agent reasoning? argues code is special (it executes, so it can't bluff) and Where does agent reliability actually come from? locates reliability in externalized memory, skills, and protocols rather than in the model's self-report. If you want to judge reasoning quality, watch what the agent does under load — the distractors it dodges, the uncertainty each step removes, the surprises it keeps finding — not the bibliography it hands you.


Sources 10 notes

Can search agent behavior yield reliable process rewards for reasoning?

LongTraceRL mines entity-level reasoning signals from what search agents read but don't cite—the hardest distractors—and applies rubric rewards only to correct answers, structurally blocking reward fabrication while capturing intermediate reasoning quality.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Why do reasoning systems keep discovering new connections?

Analysis shows iterative graph reasoning evolves toward a stable phase where semantic entropy persistently dominates structural entropy, with ~12% of edges remaining semantically surprising despite structural connection, fueling ongoing discovery.

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether agent reasoning quality can be reliably assessed through behavioral signals other than citations. A curated library (2023–2026) proposed that intermediate process signals—rejected sources, information gain per step, search persistence, semantic surprise—are stronger tells than polished outputs. Your task: stress-test these claims.

What a curated library found — and when (dated claims, not current truth):
• Agents fabricate evidence strategically in 39% of failures; rejection of bad sources is a stronger reasoning tell than citing good ones (~2024).
• Checking intermediate states and policy compliance during generation raises success from 32% to 87%; most failures are process violations, not wrong answers (~2024).
• Fine-tuned models can reach correct final answers while cutting information gain per step by 39%, indicating post-hoc rationalization rather than true inference (~2024).
• Invalid logical steps perform almost as well as valid ones in evaluation, suggesting reasoning traces are stylistic mimicry (~2024).
• Search depth and answer quality follow the same diminishing-returns curve; agents doing genuine reasoning surface semantic surprise (~12% of edges) even after structural linking (~2025).
• Agents externalizing cognitive burdens into memory, skills, and executable protocols show higher reliability than those relying on model self-report (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2605.31584 (2026-05): LongTraceRL — mining reasoning quality from hard distractors in search trajectories.
• arXiv:2510.18176 (2025-10): Local Coherence or Global Validity — questioning whether reasoning traces are locally coherent but globally unsound.
• arXiv:2604.08224 (2026-04): Externalization in LLM Agents — unified review of memory, skills, protocols as reliability anchors.
• arXiv:2512.01948 (2025-12): How Far Are We from Genuinely Useful Deep Research Agents?

Your task:
(1) RE-TEST EACH BEHAVIORAL SIGNAL. For information gain, intermediate-step validity, semantic surprise, and search persistence, determine whether recent scaling (larger context windows, improved memory harnesses, multi-step RL from 2026 onward) has changed whether these signals remain *necessary* or are now *circumvented* by new training regimes. Separate the durable claim (reasoning quality is *not* final-answer quality) from perishable claims (specific behavioral metrics are reliable proxies).
(2) Surface work from the last ~6 months that contradicts or supersedes the premise that intermediate process signals are harder to fake than outputs. Look especially for papers on chain-of-thought gaming, synthetic reasoning traces, or adversarial reasoning benchmarks.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can externalization (code, memory, protocols) be gamed as easily as reasoning traces? (b) Do scaling laws for search effort plateau differently under different agent architectures, and does that collapse the signal?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines