Can input augmentation and rephrasing compensate for smaller model limitations?
This explores whether reshaping the input — adding context, rephrasing the prompt, padding with examples — can make a small model behave like a bigger one, and the corpus says it activates latent ability but can't supply what the model never learned.
This reads the question as: when a model is too small or too weak, can you fix it from the outside — by augmenting or rephrasing what you feed it — rather than by training a better model? The corpus draws a sharp line. Prompt-side moves can only *reorganize* knowledge the model already has; they cannot *inject* knowledge it lacks. Can prompt optimization teach models knowledge they lack? frames this as a hard ceiling: no prompt strategy compensates for missing foundational knowledge, only activates what's already in the training distribution. So the honest answer is split — augmentation can recover capability the model has but isn't using; it cannot manufacture capability that was never there.
There's a deeper reason rephrasing has limited reach. Why do language models ignore information in their context? shows that when a model's parametric priors are strong, it simply ignores what you put in its context — generating answers that contradict the prompt because training-time associations dominate. The paper's blunt conclusion: textual prompting alone can't override strong priors; you'd need to intervene in the model's internal representations. That means the smaller-and-more-opinionated the model, the *less* leverage clever input phrasing buys you on exactly the cases where you'd want it most.
Worse, more input is not free. Does reasoning ability actually degrade with longer inputs? found reasoning accuracy collapsing from 92% to 68% with just 3,000 tokens of padding — far below the context limit, and even with chain-of-thought. So the instinct to 'augment' by stuffing in more examples or context can actively degrade a small model's reasoning, not rescue it. Augmentation that adds length is working against you unless every added token earns its place.
Where the corpus *does* see small models catching up, the lever is training, not input. Can small models match large models on function calling? shows small models matching large ones on function calling — but via DPO on a teacher's correct and incorrect examples, which targets the format failures directly. Does depth matter more than width for tiny language models? gets gains from architecture (deep-and-thin), and Do transformers hide reasoning before producing filler tokens? reveals that the answer is sometimes computed in early layers and then overwritten — capability hidden inside the model rather than absent from it. That last point is the most useful for your question: it suggests the real win is *surfacing* latent computation, which is closer to representation-level intervention than to rephrasing the prompt.
The takeaway a curious reader might not expect: 'compensate' has two meanings, and they come apart. If the small model already knows the thing but isn't expressing it — yes, input shaping (and especially targeted training like DPO) can unlock it. If the small model genuinely lacks the knowledge or the depth, no amount of rephrasing helps, and What stops large language models from improving themselves?'s generation–verification gap explains why: a model can't validate its way past its own ceiling without something external. Augmentation is an activation tool, not a substitute for capacity.
Sources 7 notes
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.