INQUIRING LINE

Does bounding textual edits prevent skill degradation better than free rewriting?

This explores whether putting guardrails on how much an AI agent can rewrite its own instructions or notes — limited, validated edits rather than unrestricted self-revision — actually protects it from getting worse over time.


This explores whether putting guardrails on how much an AI agent can rewrite its own instructions or notes — limited, validated edits rather than unrestricted self-revision — actually protects it from getting worse over time. The corpus gives a fairly direct answer: yes, and the mechanism matters more than the intuition. SkillOpt's ablations show that bounded editing — capped 'learning-rate budgets' for how much text can change, validation gates that test edits before keeping them, and crucially a buffer of *rejected* edits the agent remembers — outperforms uncontrolled self-revision. Free rewriting drifts toward overfitting and incoherence; the constraints prevent that drift without killing the agent's ability to adapt Does constraining edits help agents improve their own skills?.

Why would unrestricted rewriting decay in the first place? Two other notes explain the failure mode that bounding is fighting against. When models are handed documents across long delegated workflows, frontier systems silently corrupt about 25% of the content, and the errors compound through dozens of round-trips without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. Free rewriting is exactly this loop pointed at the agent's own skills — each unvalidated pass is another round-trip where small distortions accumulate. Bounding works because it interrupts the compounding: the validation gate forces each edit to earn its place, so corruption can't silently snowball.

The deeper reason a *held-out* gate is doing the heavy lifting connects to a more fundamental limit. Self-improvement in language models is formally capped by the generation-verification gap — a model cannot reliably fix itself using only its own judgment, because every trustworthy correction needs something external to validate it What stops large language models from improving themselves?, What actually constrains large language models from self-improvement?. Read that way, 'bounded edits' and 'free rewriting' aren't just two settings on a dial. Bounded editing smuggles in an external check (the validation set, the rejected-edit memory); free rewriting is the agent grading its own homework. The bound isn't merely conservative — it's the thing that supplies the external verification the model provably can't generate from metacognition alone.

There's a useful cross-domain echo here. Defending RAG systems from poisoned documents uses the same move under different vocabulary: partition-aware retrieval *bounds* how much any single suspect document can influence the output, rather than trusting the system to self-filter Can we defend RAG systems from corpus poisoning without retraining?. And the value of keeping explicit negative examples — the rejected-edit buffer — rhymes with why DPO beats plain fine-tuning for small models: learning from what *not* to do, not just from good examples, directly targets the failure cases Can small models match large models on function calling?. Across these notes the pattern is consistent: bounded influence plus retained failures beats unconstrained self-trust.

The thing you might not have expected to learn: the win isn't really about editing 'less.' It's that the bound is where the external verification lives. Strip the gates and the rejected-edit memory, and you haven't just loosened the agent — you've removed the only thing standing between it and the generation-verification ceiling that says pure self-revision can't reliably improve at all.


Sources 6 notes

Does constraining edits help agents improve their own skills?

SkillOpt's ablations show that textual learning-rate budgets, held-out validation gates, and retained failed edits outperform uncontrolled self-revision. Control mechanisms prevent drift toward overfitting and incoherence without sacrificing adaptability.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Next inquiring lines