INQUIRING LINE

How does the Learning Law explain why all examples should contribute equally?

This explores the idea—rooted in the 'learning mechanics' view of deep learning—that training treats examples as interchangeable contributors to an aggregate signal, and tests that premise against corpus findings showing examples are emphatically *not* equal.


This reads the question as being about the 'learning mechanics' frame, where training is modeled like statistical physics: you predict average-case behavior from aggregate statistics rather than tracking any individual data point. In that view, the natural assumption is that each example is one more sample contributing to a distribution, and the law that governs learning is about how those samples average out across training dynamics rather than what any single one does Can deep learning theory unify around training dynamics?. That's the 'equal contribution' intuition: no example is special, the macroscopic curve is what matters.

The interesting thing is that the rest of the corpus argues almost the opposite—and that tension is the real answer here. A single training example in RLVR can lift math accuracy from 36% to 73.6% and keep improving test performance for 1,400 steps after training accuracy already hit 100% Can a single training example unlock mathematical reasoning?. Critique fine-tuning gets RLVR-level reasoning activation from exactly one problem Can a single problem unlock reasoning through solution critique?. If examples contributed equally, one couldn't carry that much weight. The reconciliation is that these aren't *teaching* in the average-case sense—they're *activating* latent capability already in the base model. The aggregate-statistics law governs what gets learned; a single well-chosen signal can flip a switch.

Once you accept examples differ in value, the question becomes which ones to weight. Optimal experimental design beats similarity-based retrieval for few-shot selection precisely because it picks examples that maximally reduce uncertainty rather than treating them uniformly Can optimal experimental design improve few-shot example selection?. Deliberately inducing the model to *err* on certain few-shot examples, then having it articulate the principle behind the mistake, beats showing clean examples—so an example's contribution depends on what error it surfaces, not just its presence Does learning from mistakes improve in-context learning?.

And contribution isn't even a property of the example alone—it depends on the learner. Teacher-refined data that is objectively higher quality *degrades* a student when it exceeds the student's learning frontier; students do best filtering refinements against their own statistical profile Does teacher-refined data always improve student model performance?. So the same example contributes positively to one model and negatively to another. The honest synthesis: the learning-mechanics law explains why training looks like aggregate dynamics, but it does not license treating examples as equal—corpus evidence repeatedly shows that selection, ordering, error content, and learner-compatibility dominate. The 'equal contribution' premise is a modeling convenience, not a finding.

If you want the deepest cut, look at how outcome-based RL sharpens the policy unevenly across solved and unsolved problems—the gradient signal itself is distributed unequally and even transfers diversity loss from one to the other Does outcome-based RL diversity loss spread across unsolved problems?. That's a concrete mechanism for *why* equal weighting fails even when the aggregate law holds.


Sources 7 notes

Can deep learning theory unify around training dynamics?

Research shows learning mechanics is consolidating as a unified frame for deep learning, modeled on classical and statistical mechanics. It prioritizes average-case predictions, training dynamics, and aggregate statistics over worst-case bounds, mirroring how physics addresses macroscopic systems.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can a single problem unlock reasoning through solution critique?

Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.

Can optimal experimental design improve few-shot example selection?

AIPD frames demonstration selection as budgeted active learning, choosing examples that maximally reduce test-set uncertainty. Two algorithms (GO and SAL) outperformed similarity-based methods across small, medium, and large language models.

Does learning from mistakes improve in-context learning?

LEAP demonstrates that models achieve better performance on reasoning and math tasks by intentionally erring on few-shot examples, reflecting on mistakes, and deriving explicit task-specific principles—without additional labeled data or fine-tuning.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether the premise 'all training examples should contribute equally' still holds in contemporary LLM research. The question: does equal weighting of examples remain a viable assumption, or has capability progress dissolved it?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026, with the key tension emerging across this window:

• A single well-chosen example can lift math reasoning from 36% to 73.6% and sustain improvement 1,400 steps past training saturation (2025-04, RLVR path).
• Critique fine-tuning on ONE problem unlocks reasoning activation comparable to multi-example regimes (2025-06).
• Optimal experimental design beats uniform similarity-based retrieval for few-shot selection, showing examples differ in uncertainty-reduction value (2024-04).
• Error-inducing few-shot examples outperform clean ones; contribution depends on what mistake surface, not presence alone (2024-02).
• Teacher-refined high-quality data *degrades* student models when it exceeds their learning frontier; same example contributes positively to one learner, negatively to another (2025-04 implied; student-model-selection path).
• Outcome-based RL distributes gradient signals unequally across solved/unsolved problems and transfers diversity loss between them (2025-09).

Anchor papers (verify; mind their dates):
• arXiv:2504.20571 (2025-04) — RLVR, single-example reasoning activation.
• arXiv:2506.03295 (2025-06) — Critique fine-tuning on one problem.
• arXiv:2404.08846 (2024-04) — Optimal experimental design for in-context prompt selection.
• arXiv:2509.06941 (2025-09) — Outcome-based RL and diversity-loss transfer.

Your task:

(1) RE-TEST EACH CONSTRAINT. For every bullet above, determine whether newer optimization (e.g., training-time critique, hierarchical RL, curriculum learning via learned selection), model scaling, or evaluation harnesses have since *relaxed* the need for example cherry-picking or *confirmed* that non-uniform weighting is now mandatory. Separate the durable question ("Do examples differ in learning value?") from the perishable claim ("Current methods require manual selection"). State plainly where equal weighting still appears to hold in any regime.

(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months that argues for or re-establishes uniform example contribution under new conditions (e.g., newer loss functions, sampling strategies, or theoretical results).

(3) Propose 2 research questions that *assume* the regime has moved: one on automated discovery of high-value examples; one on learner-aware weighting without manual profiling.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines