Can smaller models produce skill updates as useful as frontier model updates?
This explores whether smaller models can generate updates to skills, harnesses, or instruction-libraries that are as useful as those written by frontier models — and the corpus suggests the surprising answer is mostly yes for *producing* the update, with the real bottleneck showing up elsewhere.
This explores whether a smaller model can *write* an update to a skill or harness that's as good as one a frontier model would write. The most direct finding in the corpus splits the question in two: the capacity to produce a useful update is roughly flat across model tiers — even smaller models generate comparable edits — but the ability to *use* those updates peaks at mid-tier, with both weak and strong models struggling to activate and follow updated instructions Do stronger models always evolve their own harnesses better?. So the question quietly contains two different abilities, and only the second one is size-sensitive.
Why would generating an update be size-insensitive while reasoning isn't? Because skill-writing is closer to the well-defined, repetitive language work that smaller models already handle competently — the same logic behind treating small models as sufficient for most agentic subtasks at a fraction of the cost Can small language models handle most agent tasks?. But the corpus also warns where small models hit a wall: skills decompose unevenly, and while surface-level style saturates early, logical reasoning keeps improving with scale — distillation tends to copy form, not substance Do all AI skills improve equally as models scale?. A skill update that's really a reasoning artifact in disguise won't transfer the way a formatting or workflow tweak will.
There's a deeper structural reason small models can keep pace: useful learning seems to live in a small, structured slice of the model. RL updates touch only 5–30% of parameters, in nearly full-rank subnetworks that are consistent across seeds — improvement is concentrated, not diffuse Does reinforcement learning update only a small fraction of parameters?. And the heavy lifting can be moved out of the model entirely: a separately trained *curator* paired with a frozen executor learns to evolve a skill repository toward sharp execution logic and cross-task meta-strategies — and that curator generalizes across different executor backbones Can a separate trained curator improve skill libraries better than frozen agents?. If the intelligence lives in the curation loop rather than the base model, the executor's size matters less.
The broader pattern across the corpus is that selection and compute often beat raw scale. Smaller models with extra inference-time compute match larger ones on hard prompts Can inference compute replace scaling up model size?; DPO-trained small models match large ones on function-calling by learning from a teacher's mistakes Can small models match large models on function calling?; step-wise expert-similarity rewards let small models learn reasoning that sparse outcome rewards can't teach Can step-wise expert rewards help small models learn hard reasoning?; and routing a fleet of 7B models past frontier accuracy shows selection is a stronger lever than scaling Can routing beat building one better model?. The thing you didn't know you wanted to know: smaller models are sometimes *better* at the generative half — around 500M parameters they produce more unique outputs per sample because large models concentrate probability mass and lose diversity Why aren't bigger models better for generating diverse outputs?. So for proposing a varied set of candidate skill updates, small can win — the frontier's edge is in judging and reliably executing them, not in writing them.
Sources 10 notes
Model strength doesn't bottleneck writing useful harness edits—even smaller models generate comparable improvements. But using those updates non-monotonically peaks at mid-tier models, with weak and strong models both struggling to activate and follow updated instructions.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.