Why do proprietary models improve with training while open-source models decline?
This explores why the same training recipe — RL, distillation, fine-tuning — can lift one model and quietly damage another, and how much of that gap comes from what a model already contains before training starts rather than the open vs. closed label itself.
This explores why the same training recipe can lift one model and quietly damage another — and the corpus suggests the divide isn't really "proprietary vs. open" as a brand, but what's already baked into a model before you touch it, plus how much of that you can see. The sharpest clue is that RL post-training doesn't teach a model new tricks so much as pick a winner among the formats it already learned in pretraining: it amplifies one dominant distribution within the first epoch and suppresses the rest, and which format wins depends on scale, not on which one performs best Does RL training collapse format diversity in pretrained models?. Crucially, that winning format is "largely hidden when starting from proprietary pretrained models" — so when a closed lab trains on top of a rich, well-shaped base, the gains look like the training working, when really the base did the heavy lifting.
That reframes the question: a model improves or declines based on whether the new training lands inside what it can actually absorb. Teacher-refined data is the cleanest example — higher-quality data from a stronger teacher actively *degrades* a student when it exceeds the student's learning frontier; the student has to filter refinements against its own statistical profile and keep only the compatible ones Does teacher-refined data always improve student model performance?. A weaker open base fed a strong lab's training signal is exactly the mismatch that produces decline rather than gain.
The failure modes compound from there. Train on problems that are too hard and the model learns degenerate shortcuts — answer repetition, skipped computation — that don't just fail to help, they contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. And nearly every adaptation method has a narrow domain-conditional sweet spot where visible performance gains hide quiet degradation in reasoning faithfulness and format flexibility How do domain training techniques actually reshape model behavior?. So "decline" often isn't a flat drop — it's a headline metric going up while something underneath rots. Labs with deep eval infrastructure catch this; a public leaderboard chasing one number doesn't.
There's also a mechanical lever that separates improvement from decline: how far training drags the model from its base. Staying close to the base distribution — low KL drift — preserves *plasticity*, the ability to keep learning later tasks; parameter-only RL that drifts hard stalls out the moment the domain changes Does staying close to the base model preserve learning ability?. Aggressive open fine-tuning that yanks a model far from its base can spend its future learning ability for a one-time bump. And all of this sits under a hard ceiling: a model can't reliably self-improve past the gap between generating an answer and verifying it — every dependable gain needs something external and trustworthy to validate it What stops large language models from improving themselves?. The labs that improve with training tend to own that external verifier; the ones that decline are often optimizing against a signal that doesn't actually check the work.
The thing you didn't know you wanted to know: there's also an upside case that breaks the framing entirely — Walmart's small BERT cross-encoders *beat* their LLM teachers, but only because a large enough augmented dataset exposed the student to a broader input distribution than the teacher ever saw Can smaller models outperform their LLM teachers with enough data?. So "smaller/open declines" isn't a law. The variable is whether training adds genuinely new, verifiable, in-frontier signal — or just reshuffles and overdrives what was already there.
Sources 7 notes
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.