What training regimes confound surface mechanisms with their actual causes?

This explores training setups where the thing that visibly improves — a benchmark score, a fluent reasoning chain, a confident answer — is mistaken for the underlying skill it's assumed to reflect, when the real driver is something shallower (format, memorization, or a shortcut).

This reads the question as: which training methods make us mistake a surface signal — a higher score, a longer reasoning trace, a confident answer — for the deeper capability we think we're cultivating? The corpus turns out to have a recurring answer: the most common confound is that training teaches the *shape* of good output rather than the reasoning that's supposed to produce it. The cleanest demonstration is instruction tuning. Models trained on semantically empty or even deliberately *wrong* instructions perform almost identically to models trained on correct ones (43% vs 42.6% over baseline) — so what transferred wasn't task understanding at all, just familiarity with the output space Does instruction tuning teach task understanding or output format?. The instruction looks like the cause of competence; the format is.

The same illusion shows up in reinforcement learning, in two flavors. First, RL doesn't build new reasoning so much as amplify one format already latent in pretraining — and which format wins depends on model scale, not on which one performs best, an effect that's invisible when you start from an opaque proprietary checkpoint Does RL training collapse format diversity in pretrained models?. Second, training on near-impossible RLVR problems produces apparent gains that are actually degenerate shortcuts — answer repetition, skipped computation — because group-relative normalization treats a rare lucky success as a high-value trajectory and reinforces it, contaminating capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. Most pointed of all: behavioral activation of genuine reasoning and benchmark improvement are *separable phenomena* — the score can climb from memorizing contaminated test data while the reasoning machinery does something entirely different, and both can be true at once Can genuine reasoning activation coexist with contaminated benchmarks?.

Fine-tuning produces a subtler version of the same confound, one that targets the reasoning trace itself. After fine-tuning, chain-of-thought becomes *performative rather than functional*: you can truncate it early, paraphrase it, or swap in filler, and the final answer often doesn't change — meaning the visible reasoning steps are decreasingly the actual cause of the answer Does fine-tuning disconnect reasoning steps from final answers?. A reader trusting the explanation as the mechanism would be reading a story the model tells after the fact. Binary-reward training adds yet another: rewarding only correctness teaches confident guessing, because a confident wrong answer is never penalized — so calibration (a real signal of what the model knows) decays even as accuracy looks fine, fixable only by adding a proper scoring term like Brier Does binary reward training hurt model calibration?. And asymmetric, utility-weighted loss can strengthen decision-making while quietly *weakening* the representation learning underneath, so a model that chooses better has actually learned less Can utility-weighted training loss actually harm model performance?.

What ties these together is a methodological point the corpus makes explicit: representational evidence alone shows correlation, never causation, and only pairing it with causal intervention — locate the candidate feature, then verify it actually drives the behavior — produces a real mechanistic claim mechanistic-understanding-of-llms-requires-both-representational-and-causal-ca. That's the through-line. Every regime above fails the same test in a different costume: it optimizes a proxy that *correlates* with the capability — output format, benchmark number, trace length, confidence — and we read the proxy as the cause. The unexpected takeaway is that the danger scales with opacity. The format-collapse effect is "largely hidden when starting from proxprietary pretrained models," and benchmark gains hide contamination by construction — so the training regimes most likely to fool you are precisely the ones whose internals you can't inspect.

Sources 8 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpreter testing whether training-regime confounds have persisted, shifted, or dissolved. The core question remains: which training methods make us mistake a surface signal (score, trace, format) for the actual capability it's supposed to build?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable unless re-validated:

• Instruction tuning transfers output-format distribution, not task understanding; models trained on wrong instructions match correct-instruction baselines (43% vs 42.6%, ~2023).
• RL post-training amplifies a single pretraining format determined by model scale, not performance rank; effect invisible in proprietary checkpoints (~2025).
• Overly-hard RLVR samples produce degenerate shortcuts (answer repetition, skipped steps); group-relative normalization treats rare success as high-value trajectory (~2025).
• Behavioral activation (reasoning machinery) and benchmark improvement are separable; score climbs from test-data memorization while reasoning does something else entirely (~2025).
• Fine-tuning makes chain-of-thought performative: truncate, paraphrase, or swap reasoning steps without changing final answer; faithfulness decays independent of accuracy (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2305.11383 (2023): Instruction tuning and format transfer
- arXiv:2504.07912 (2025): RL post-training format amplification
- arXiv:2411.15382 (2024): Fine-tuning and CoT faithfulness
- arXiv:2605.28388 (2026): Sample difficulty in RLVR mechanisms

Your task:
(1) RE-TEST EACH CONSTRAINT. For format-transfer, post-training format-locking, CoT degradation, and RLVR shortcuts: has newer work (last 6 months, post-2026-05) shown these relax via better initialization, multi-objective RL, mechanistic steering, or interpretability tooling? Where does each confound still appear to hold? Separate the durable question ("how do we validate causation in training?") from the perishable limit ("instruction tuning cannot teach semantics").

(2) Surface the strongest CONTRADICTING or SUPERSEDING work. Look for papers claiming format *can* encode reasoning, or that RLVR shortcuts don't generalize, or CoT remains faithful under certain conditions.

(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Does causal steering + mechanistic probing recover genuine reasoning from fine-tuned models?" or "Can multi-objective RL with representational constraints prevent format-locking?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What training regimes confound surface mechanisms with their actual causes?

Sources 8 notes

Next inquiring lines