How does information asymmetry between teacher and student create the learning signal?
This explores why a teacher needs to *know something the student doesn't* — like the correct answer — for teaching to produce any useful correction at all, and what that asymmetry costs.
This explores why a teacher needs to know something the student doesn't for teaching to generate a learning signal — and the corpus is unusually clear on the mechanism. The core claim is almost arithmetic: if teacher and student share identical uncertainty, the teacher has nothing to correct toward. The signal *is* the gap. A teacher with privileged access to correct answers or verifier output can produce a corrective gradient pointing from where the student is to where it should be; strip that access away and pedagogical correction becomes impossible because both parties are equally lost Why does teacher-student information asymmetry enable learning signals?.
But the corpus immediately complicates the happy version of this. The very asymmetry that makes correction possible also shapes *how* the student learns to talk. Teachers conditioned on the right answers produce confident, short traces — and students inherit that confidence wholesale, including in situations where they should be hedging. The result is a quiet trade: sharper in-domain performance bought with worse generalization to unfamiliar problems that actually call for epistemic caution Does richer teacher context hurt student generalization?. So the teacher's extra information leaks into the student as borrowed certainty, not just borrowed answers.
The interesting twist is that more privileged information doesn't monotonically help. Teacher-refined data can *degrade* a student when the refinement sits beyond the student's learning frontier — objectively higher quality, but incompatible with where the student currently is. The fix is for the student to filter against its own statistical profile, keeping only what it can absorb Does teacher-refined data always improve student model performance?. And in some regimes the student overtakes the teacher entirely: Walmart's small cross-encoders beat the LLMs that labeled their data, because exposure to a far broader input distribution — smoothed by the teacher's predictions — let them generalize past what the teacher itself could do Can smaller models outperform their LLM teachers with enough data?. The teacher's role there is less oracle than scaffold.
What you didn't know you wanted to know: this same asymmetry logic shows up where there's no teacher at all. An agent's own shifting beliefs can serve as the privileged signal — log-ratios of how much a solution looks more likely turn-over-turn become a dense intrinsic reward, no critic required Can an agent's own beliefs guide credit assignment without critics?. And feedback itself carries two separable kinds of information — evaluative (how good was that?) and directive (which way to move?) — which a scalar reward can't hold jointly, but a richer signal can Can scalar rewards capture all the information in agent feedback?. The throughline: a learning signal is always an *information differential* — between teacher and student, between an agent's beliefs across time, or between what an outcome says versus what it tells you to do. Where that gap vanishes, so does the teaching. And tellingly, when LLMs simulate social agents who hold *private* information, they fail — because omniscient setups let them skip the grounding work that real asymmetry demands Why do LLMs fail when simulating agents with private information?.
Sources 7 notes
Social meta-learning requires information asymmetry—the teacher's access to correct answers or verifier output—to generate meaningful corrective signals. Without this asymmetry, teacher and student share identical uncertainty, making pedagogical correction impossible.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.