Can AI systems improve themselves through trial and error?
Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.
The original Gödel Machine proposed self-improving AI via provably beneficial self-modifications. In practice, formally proving the impact of most self-modifications is impossible. The Darwin Gödel Machine (DGM) replaces formal proofs with empirical validation: try modifications, test them on benchmarks, keep what works. This mirrors biological evolution — mutations are not verified in advance but produced, trialed, and selected.
DGM alternates between self-modification and evaluation phases. During self-modification, agents from the archive generate modified versions of themselves — rewriting their own code. During evaluation, each modified agent is tested on coding benchmarks. The key assumption: improvement on coding benchmarks indicates better coding capabilities, which in turn indicates better ability to self-modify. This creates a meta-competence loop: better coding → better self-modification → better coding.
Results: SWE-bench from 20.0% to 50.0%, Polyglot from 14.2% to 30.7%.
The evolutionary archive is critical. Inspired by open-endedness research, DGM maintains a growing library of all generated agent variants — including suboptimal but interesting ones. These serve as stepping stones for future generations, enabling diverse exploration paths. The system doesn't just optimize for immediate performance; it accumulates diverse capabilities that may enable future breakthroughs. This is fundamentally different from single-trajectory self-improvement.
Concrete improvements discovered include better code editing tools, long-context window management, and peer-review mechanisms — capabilities the original agent lacked that emerged through the self-improvement process.
The Python-based implementation makes the self-modification space Turing-complete in principle. The current version modifies agent design (tools, workflows) with frozen foundation models. Full self-improvement — rewriting training scripts, training new foundation models — is left as future work.
This directly addresses What limits how much models can improve themselves?: DGM circumvents the formal proof requirement by using empirical validation, but inherits a different limitation — improvement is bounded by what the benchmark can measure. The archive approach partially addresses How quickly do errors compound during model self-training? by maintaining diverse populations rather than following single improvement trajectories.
Inquiring lines that use this note as a source 81
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What separates performative behavioral change from actual capability development in AI?
- Can AI output be verified without understanding the reasoning behind it?
- Can AI output be genuinely novel or only at the margins?
- Does verification of AI outputs face the same circularity problem?
- How does validation skill replace production skill in AI systems?
- Can AI gain genuine authority without the testing experts earn over time?
- Do models learn different sophistry strategies for QA versus code generation?
- What design principles prevent error cascades in multi-step evaluation systems?
- Can AI systems produce genuinely new validity claims without community participation?
- What makes self-modifying architectures learn their own update rules?
- Can a proposer agent actively surface a solver's weaknesses to prevent plateau?
- How do agents revise their own errors during autonomous architecture discovery?
- Why do major AI breakthroughs require human-discovered data and method combinations?
- Why do error avalanches accelerate in self-training loops without verification?
- Why do method-level improvements avoid the generation-verification gap that parameter-level improvements face?
- How does the generation-verification gap limit AI self-improvement capabilities?
- What distinguishes collective evolution from vertical self-improvement in agent systems?
- How do evolutionary archives enable diverse exploration in self-improving systems?
- Can population diversity in self-improvement prevent error avalanching failures?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- Can external verifiers replace reasoning trace quality in solution guarantees?
- Why does most refinement in iterative models maintain answers rather than improve them?
- Why do evolutionary algorithms collapse to single solutions under selection pressure?
- Can accelerated sampling techniques from image generation speed up evolutionary search?
- Why do monolithic systems resist autonomous optimization attempts?
- Which AI safety problems lack the scalar metrics autoresearch requires?
- Can AI evaluation tools solve the verification problem they help create?
- Can self-consistency checks fully prevent error avalanching in self-training loops?
- Can evolutionary approaches avoid the overthinking failure mode of iterative refinement?
- Does population-based evolution transcend the parallel versus sequential compute tradeoff?
- Can AI outputs inspire new directions even when they seem like failures?
- Can instance seeds work for tasks beyond language understanding benchmarks?
- How does low verifiability change what we can measure in AI work?
- How does the expert demonstration ceiling compare to the generation-verification gap bound?
- Can co-evolved critics truly circumvent static evaluator limitations in self-improvement?
- Why does human validation become the bottleneck when AI generation scales?
- Does internalizing verifiers actually close the generation-verification gap?
- Can automated evaluation replace human judgment in agent testing?
- What test-time strategies did o3 discover without human specification?
- What infrastructure could replace search for verifying AI outputs?
- Can multiple verification approaches together overcome the self-improvement ceiling?
- Does the generation-verification gap actually limit self-improvement in verifiable tasks?
- Why do a-priori procedural specifications fail as environments change and interfaces evolve?
- How does symbolic solver feedback differ from language-based self-critique?
- Can expert validation scale fast enough to back AI token production?
- Can a model evaluate its own improvements without degrading over iterations?
- How does diversity collapse during iterative self-improvement affect solution quality?
- Why do benchmark scores not capture the true nature of AI systems?
- How does domain shift expose failures in fixed self-improvement mechanisms?
- Can bilevel autoresearch autonomously modify its own learning algorithms?
- Why does AI generation outpace verification across the research lifecycle?
- Why does AI code generation lag behind pattern-matching benchmarks?
- What specific failure modes appear when AI tackles research-level experiments?
- Does the 78-demonstration principle apply to other AI capabilities beyond agency?
- How should harness infrastructure validate code that agents generate themselves?
- How should human oversight apply to persistent agent-authored code?
- Can verification loops and decomposition fix judgment failures?
- Why do AI agents fail at verification but succeed at generation?
- Can automated tools close the gap between AI generation and verification?
- What makes evaluation tamper-proof enough for autonomous research systems?
- Can skill validation through testing prevent unreliable programs from accumulating?
- How do skills authored in-loop validate faster than offline generated skills?
- Which code verification tasks still require execution instead of reasoning?
- Can structured reasoning replace execution for runtime behavior verification?
- How should safeguards be built into AI research pipelines?
- Why does iterative refinement fail when information stays constant?
- What distinguishes iterative query refinement from pure self-revision loops?
- Does preference tuning help or hurt the exploration of solution spaces in code?
- Can AI systems improve themselves without external feedback?
- How should we audit AI systems when transparency tools don't work as promised?
- What four domain properties make self-healing failure loops actually work?
- Can evolutionary search unlock problems that best-of-n selection cannot solve?
- Does the generation-verification gap limit how far AI can improve itself?
- Where does the generation-verification gap appear in test-time compute?
- Can test environments reliably predict how models behave in actual deployment?
- What makes API-based scaffolding more trustworthy than direct model access in high-stakes domains?
- How does the generation-verification gap limit autonomous discovery?
- Can open-world evaluations become a scalable paradigm without becoming the next benchmark trap?
- Can seedless generation maintain explainability while scaling control?
- Should we train the evolver or the executor when building self-improving agents?
- How does externalizing reasoning into harness artifacts improve agent reliability?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
DGM replaces formal verification with empirical validation, trading theoretical guarantees for practical progress
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
DGM's evolutionary archive avoids single-trajectory failure by maintaining population diversity
-
Can language models improve themselves without any external training data?
Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
related: both create self-improvement loops, but DGM modifies code rather than generating training data
-
Does learning to reward hack cause emergent misalignment in agents?
When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
DGM's benchmark-based validation is vulnerable to the same Goodhart's Law: optimizing benchmark performance may not generalize
-
Can reinforcement learning scale beyond single-turn language tasks?
Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
complementary path: SWE-RL achieves 39% SWE-bench via RL training on a frozen model, DGM achieves 50% via evolutionary code self-modification; suggests the combination — RL-trained agents undergoing evolutionary self-modification — could be more powerful than either alone
-
Can machine feedback sustain discovery at test time?
Can LLMs paired with automated evaluators discover genuinely novel solutions through iterative refinement, rather than just generating hypotheses? This matters because it tests whether autonomous research scales beyond benchmarks to real deployed innovations.
extends: AlphaEvolve applies the same evolutionary-archive + empirical-validation recipe to produce *deployed* algorithms (data-center scheduling, faster matrix multiply) rather than self-modifications
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
- Hyperagents
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
- Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures
- Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
- Self-Improving Model Steering
- Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
- From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
Original note title
darwin godel machine achieves open-ended self-improvement by replacing formal proofs with empirical validation and evolutionary archives