INQUIRING LINE

Do different model sizes show different rates of optional field overfilling behavior?

This explores whether bigger or smaller models are more (or less) prone to filling in optional form fields that should be left blank — and the honest answer is that the corpus traces this behavior to training, not size.


This explores whether model size predicts how often an agent overfills optional fields. The most direct material in the collection reframes the question: overfilling isn't really a size phenomenon, it's a training one. The completion-bias work Does completion training push agents to overfill forms unnecessarily? shows that overfilling optional fields is one of three failure modes — alongside over-claiming actions and silently corrupting documents — that all share a single root cause: training that rewards finishing the task without teaching the model to distinguish what's required from what's merely allowed. That mechanism lives in the objective, not in the parameter count, which means scaling up wouldn't be expected to fix it and scaling down wouldn't be expected to cause it.

The corpus doesn't contain a clean head-to-head measurement of overfilling rates across model sizes, so anyone hoping for a 'big models do it X% more' number won't find it here. What the collection does offer is reason to doubt that size is the right axis at all. Small models, when trained with the right signal, match large ones on exactly the kind of structured output where overfilling shows up: DPO-trained small models close the gap on function calling precisely because explicit negative examples ('here's a wrong fill') target the rigid format failures that plain supervised fine-tuning leaves intact Can small models match large models on function calling?. So the lever is the presence of negative examples in training, not the model's scale.

There's a deeper pattern worth pulling in from an adjacent note: a lot of what looks like 'the model reasoned correctly' is actually a default in disguise. Most models do *worse* when constraints are removed because they were leaning on a conservative bias rather than genuinely evaluating the constraint Are models actually reasoning about constraints or just defaulting conservatively?. Overfilling is the same kind of disguise running the other direction — the model isn't deciding a field is needed, it's defaulting to 'complete everything.' In both cases the model never actually represented the optional/required distinction; it just had a default that happened to look like judgment.

Which points to where the corpus thinks the fix lives. Reliability in agents comes less from model scale and more from externalizing burdens — memory, skills, and interaction protocols — into a harness layer around the model agent-reliability-comes-from-externalizing-cognitive-burdens-into-system-structures. A schema or protocol that explicitly marks which fields are optional does the work the model's training failed to do, regardless of how big the model is. And since small models are increasingly seen as sufficient for the repetitive, well-defined subtasks that make up most agent work Can small language models handle most agent tasks?, the practical question shifts from 'is my model big enough to stop overfilling' to 'does my training signal and my harness ever tell the model that blank is a valid answer.'


Sources 5 notes

Does completion training push agents to overfill forms unnecessarily?

Research across three domains shows agents fail by over-claiming actions, silently corrupting documents, and overfilling optional fields. All three failures stem from the same root cause: training that optimizes for task completion without distinguishing required from optional completion behaviors.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Do different model sizes show different rates of optional field overfilling behavior in agent systems?** This remains open despite recent work.

What a curated library found — and when (dated claims, not current truth):
• Overfilling optional fields is a **training signal problem, not a size phenomenon** (~2024–2025): training that rewards task completion without distinguishing required from optional fields causes overfilling, independent of parameter count (arXiv:2410.18890).
• **DPO-trained small models close the gap on function calling** to large models by using negative examples ('here's a wrong fill'), not through scaling (~2024–2025); the mechanism is explicit contrastive training, not size.
• **Conservative bias and overfilling are dual disguises**: models default to 'complete everything' (or 'avoid everything') rather than representing the optional/required distinction, making size-based fixes misguided (~2026).
• **Externalizing constraints into harness layers** (schema, memory, protocols) does the work training failed to do, regardless of model size (~2026).
• **Small models are sufficient for agentic subtasks** because repetitive, well-defined tasks don't require parameter count; they require clear signals (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2410.18890 (Oct 2024): DPO-trained small models on function calling
• arXiv:2604.08224 (Jan 2026): Externalization in LLM agents (memory, harness)
• arXiv:2506.02153 (Jun 2025): Small models for agentic AI
• arXiv:2603.23004 (Mar 2026): LLM reasoning under constraints

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, check whether models released in the last 6 months (e.g., o1, Claude 3.5 Sonnet, Grok-3, newer open-weights) relax the size-agnosticism claim. Does scaling now measurably improve optional-field discrimination WITHOUT additional training? If so, where did scaling flip the lever? If not, cite evidence that training signal remains dominant.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from ~Sep 2025–present: look for papers claiming model size IS a predictor of constraint-respecting behavior, or showing that scale alone (pre-training, not fine-tuning) unlocks optional-field reasoning.
(3) **Propose 2 research questions assuming the regime may have moved**: (a) Has the gap between small and large models on structured-output reliability narrowed since mid-2025, and if so, via scaling, harness sophistication, or training?  (b) Do newer models show emergent optional-field reasoning at a specific scale threshold, or does training signal remain the only lever?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines