Does teacher-refined data always improve student model performance?
Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
Standard instruction tuning improvement pipelines assume: teacher refines training data → student trains on refined data → student improves. Selective Reflection-Tuning challenges this with a compatibility argument: data quality is relative to the student, not absolute. A response "improved" by a GPT-4 teacher may introduce knowledge complexity or reasoning patterns that conflict with the student's current knowledge state — producing degraded training signal despite being objectively higher quality.
The fix: after teacher refinement, have the student model evaluate each refined sample and decide whether to incorporate it. The student uses its own statistical profile as the selection criterion — what it finds tractable and useful given its current weights. Teacher-refined data the student can't process effectively is filtered out; compatible refinements are retained.
The underlying argument is metacognitive: the appropriate training signal for a model at capability level T is not the best possible response in absolute terms but the best response compatible with the model's current learning frontier. Overshoot in data quality creates a mismatch analogous to teaching advanced calculus before arithmetic is solid — the instruction is correct but the student can't absorb it.
This adds a dimension to the SFT quality literature. Correctness of training targets is necessary but not sufficient — compatibility with the specific student's current distribution is equally required. A data-quality pipeline that doesn't account for student compatibility will produce inconsistent results across different model sizes, initializations, and training stages.
Connects to Does supervised fine-tuning actually improve reasoning quality?: both identify SFT quality failures; this paper adds that even "better" data in absolute terms can degrade performance if the student-compatibility dimension is ignored.
Teacher benchmark scores don't predict teaching effectiveness (OpenThoughts): In SFT data curation for reasoning models, QwQ-32B outperforms DeepSeek-R1 as a teacher despite scoring lower on target reasoning benchmarks. This extends the student-compatibility argument: even the teacher dimension is not just about absolute quality. A weaker-performing model may produce responses whose reasoning patterns are more compatible with the student's learning frontier. Additional findings: quality source selection beats diversity (top 1-2 question sources > top 8-16), difficulty-based and response-length filtering outperform embedding-based or fastText filters, and sampling 16x answers per question is an effective scaling strategy — increasing dataset size 16x through multi-answer sampling drives significant gains.
Inquiring lines that use this note as a source 58
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does extended exoskeleton use eventually produce meaningful skill transfer?
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- Why do proprietary models improve with training while open-source models decline?
- Can curated demonstrations compensate for smaller or simpler training environments?
- Why does self-generated training data outperform externally sourced data?
- Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
- How does distributional distance from pre-training relate to model difficulty?
- Why does training data format matter more than domain content?
- Why do easy training examples contribute less to model generalization than hard ones?
- Can gradient-based influence scores beat difficulty metrics for identifying valuable training data?
- When does knowledge distillation produce student models superior to teachers?
- Why does self-generated training data outperform externally curated domain examples?
- How should learning environments balance error prevention with pedagogical value?
- Can the serving loop itself become the primary training data source?
- What conditions make training diversity better than individual expert quality?
- When do aggregated imperfect demonstrations fail to outperform the best expert?
- Can selecting the right data subset outperform training on everything?
- Why does curriculum learning with tight budgets beat fixed-budget approaches?
- What makes training data quality more important than quantity for reasoning?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- Can curriculum degradation of document quality accelerate policy learning?
- How does training data distribution determine what models can learn?
- Why does mixing reasoning traces from different teachers destabilize learning?
- Can backward transfer measurements reliably predict optimal multi-task training order?
- What makes utility-weighted training backfire in machine learning systems?
- What training data contamination rates threaten model safety most practically?
- Does prompt performance vary by how well training data covers the domain?
- Does foundational model training or user priors more strongly shape final outputs?
- Does self-generated training data reduce a model's capability diversity?
- Why do weaker models generate better training data than stronger models?
- Does training data format matter more than who generates it?
- Why do weaker teacher models sometimes produce better training signals than stronger ones?
- What filtering criteria best identify student-compatible refinements from teacher models?
- Can self-training drift be prevented by applying student compatibility filtering?
- How much task-similar finetuning data does test-time training actually need?
- How does information asymmetry between teacher and student create the learning signal?
- What specific qualities make some demonstrations more effective for agency training?
- How does a challenger's escalating difficulty function as curriculum?
- How do difficulty metrics relate to the true value of training examples?
- Can explicit reflection during AI-assisted work improve transfer of learning?
- Can personalized AI learning systems actually widen rather than narrow educational gaps?
- Can thought quality alone be trusted to guide model training?
- How should training data be constructed to preserve teacher-student information gaps?
- What makes policy self-distillation more effective than external teacher distillation?
- Can teachers trained under uncertainty constraints distill better generalizing students?
- Why do adaptive curriculum schemes outperform static difficulty filters?
- How does difficulty-adaptive curriculum learning change which samples get selected during training?
- How does the optimal difficulty band shift as the model's capabilities improve during training?
- Why does curriculum order matter when information theory says data order is irrelevant?
- Why does information asymmetry between teacher and student enable effective feedback learning?
- Do text-space skills transfer learning across different frontier models?
- Why do students learn better from explanations than from solving problems from scratch?
- Does pretraining data size matter less than base model scale for finetuning?
- Which finetuning method works best across different task and data regimes?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?
- How does the Learning Law explain why all examples should contribute equally?
- Can mid-tier models benefit more from self-generated harness updates than others?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
both identify SFT quality failures; student-compatibility is a missing dimension in the data-quality literature
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
teacher refinement without compatibility filtering is structurally vulnerable to the same drift as unchecked self-training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
- Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- Task Contamination: Language Models May Not Be Few-Shot Anymore
- AI Meets the Classroom: When Does ChatGPT Harm Learning?
Original note title
teacher-refined instruction data requires student-model selection because refinement compatibility depends on the student's current distribution