Can AI systems improve themselves through trial and error?

Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

The original Gödel Machine proposed self-improving AI via provably beneficial self-modifications. In practice, formally proving the impact of most self-modifications is impossible. The Darwin Gödel Machine (DGM) replaces formal proofs with empirical validation: try modifications, test them on benchmarks, keep what works. This mirrors biological evolution — mutations are not verified in advance but produced, trialed, and selected.

DGM alternates between self-modification and evaluation phases. During self-modification, agents from the archive generate modified versions of themselves — rewriting their own code. During evaluation, each modified agent is tested on coding benchmarks. The key assumption: improvement on coding benchmarks indicates better coding capabilities, which in turn indicates better ability to self-modify. This creates a meta-competence loop: better coding → better self-modification → better coding.

Results: SWE-bench from 20.0% to 50.0%, Polyglot from 14.2% to 30.7%.

The evolutionary archive is critical. Inspired by open-endedness research, DGM maintains a growing library of all generated agent variants — including suboptimal but interesting ones. These serve as stepping stones for future generations, enabling diverse exploration paths. The system doesn't just optimize for immediate performance; it accumulates diverse capabilities that may enable future breakthroughs. This is fundamentally different from single-trajectory self-improvement.

Concrete improvements discovered include better code editing tools, long-context window management, and peer-review mechanisms — capabilities the original agent lacked that emerged through the self-improvement process.

The Python-based implementation makes the self-modification space Turing-complete in principle. The current version modifies agent design (tools, workflows) with frozen foundation models. Full self-improvement — rewriting training scripts, training new foundation models — is left as future work.

This directly addresses What limits how much models can improve themselves?: DGM circumvents the formal proof requirement by using empirical validation, but inherits a different limitation — improvement is bounded by what the benchmark can measure. The archive approach partially addresses How quickly do errors compound during model self-training? by maintaining diverse populations rather than following single improvement trajectories.

Inquiring lines that use this note as a source 81

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 158 in 2-hop network ·dense cluster Open in graph ↗

Can AI systems improve themselves through trial … What limits how much models can improve themselves… How quickly do errors compound during model self-t… Can language models improve themselves without any… Does learning to reward hack cause emergent misali… Can reinforcement learning scale beyond single-tur… Can machine feedback sustain discovery at test tim…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
DGM replaces formal verification with empirical validation, trading theoretical guarantees for practical progress
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
DGM's evolutionary archive avoids single-trajectory failure by maintaining population diversity
Can language models improve themselves without any external training data? Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
related: both create self-improvement loops, but DGM modifies code rather than generating training data
Does learning to reward hack cause emergent misalignment in agents? When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
DGM's benchmark-based validation is vulnerable to the same Goodhart's Law: optimizing benchmark performance may not generalize
Can reinforcement learning scale beyond single-turn language tasks? Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
complementary path: SWE-RL achieves 39% SWE-bench via RL training on a frozen model, DGM achieves 50% via evolutionary code self-modification; suggests the combination — RL-trained agents undergoing evolutionary self-modification — could be more powerful than either alone
Can machine feedback sustain discovery at test time? Can LLMs paired with automated evaluators discover genuinely novel solutions through iterative refinement, rather than just generating hypotheses? This matters because it tests whether autonomous research scales beyond benchmarks to real deployed innovations.
extends: AlphaEvolve applies the same evolutionary-archive + empirical-validation recipe to produce *deployed* algorithms (data-center scheduling, faster matrix multiply) rather than self-modifications

Can AI systems improve themselves through trial and error?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4