SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Does teacher-refined data always improve student model performance?

Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection
How do you build domain expertise into general AI models? How should researchers navigate LLM reasoning research?

Standard instruction tuning improvement pipelines assume: teacher refines training data → student trains on refined data → student improves. Selective Reflection-Tuning challenges this with a compatibility argument: data quality is relative to the student, not absolute. A response "improved" by a GPT-4 teacher may introduce knowledge complexity or reasoning patterns that conflict with the student's current knowledge state — producing degraded training signal despite being objectively higher quality.

The fix: after teacher refinement, have the student model evaluate each refined sample and decide whether to incorporate it. The student uses its own statistical profile as the selection criterion — what it finds tractable and useful given its current weights. Teacher-refined data the student can't process effectively is filtered out; compatible refinements are retained.

The underlying argument is metacognitive: the appropriate training signal for a model at capability level T is not the best possible response in absolute terms but the best response compatible with the model's current learning frontier. Overshoot in data quality creates a mismatch analogous to teaching advanced calculus before arithmetic is solid — the instruction is correct but the student can't absorb it.

This adds a dimension to the SFT quality literature. Correctness of training targets is necessary but not sufficient — compatibility with the specific student's current distribution is equally required. A data-quality pipeline that doesn't account for student compatibility will produce inconsistent results across different model sizes, initializations, and training stages.

Connects to Does supervised fine-tuning actually improve reasoning quality?: both identify SFT quality failures; this paper adds that even "better" data in absolute terms can degrade performance if the student-compatibility dimension is ignored.

Teacher benchmark scores don't predict teaching effectiveness (OpenThoughts): In SFT data curation for reasoning models, QwQ-32B outperforms DeepSeek-R1 as a teacher despite scoring lower on target reasoning benchmarks. This extends the student-compatibility argument: even the teacher dimension is not just about absolute quality. A weaker-performing model may produce responses whose reasoning patterns are more compatible with the student's learning frontier. Additional findings: quality source selection beats diversity (top 1-2 question sources > top 8-16), difficulty-based and response-length filtering outperform embedding-based or fastText filters, and sampling 16x answers per question is an effective scaling strategy — increasing dataset size 16x through multi-answer sampling drives significant gains.

Inquiring lines that use this note as a source 58

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

teacher-refined instruction data requires student-model selection because refinement compatibility depends on the student's current distribution