INQUIRING LINE

Does architectural discovery follow an empirical scaling law like neural networks?

This explores whether the *process of discovering new neural network architectures* — not just training existing ones — obeys a predictable scaling law, where more compute reliably buys more breakthroughs.


This explores whether architectural discovery itself scales like model performance does — whether pointing more GPUs at the search for new designs yields proportionally more good designs, the way more parameters and data yield lower loss. The corpus's most direct answer is yes: ASI-ARCH ran 1,773 autonomous experiments and surfaced 106 state-of-the-art architectures, finding that the rate of architectural breakthroughs scaled predictably with compute Can computational power accelerate scientific discovery itself?. The striking claim isn't just that automated search works — it's that *discovery* becomes a computation-bound process rather than a human-creativity-bound one, which is a genuinely different kind of scaling law than the classic Kaplan/Chinchilla curves about loss-versus-parameters.

What makes this more than a one-paper result is that two other systems land in the same territory from different angles. Genesys used multi-agent LLMs with genetic programming to generate 1,062 novel architectures, several beating GPT-2 and Mamba-2 — and crucially found that *how* you represent the search space matters enormously: structured genetic-programming representation lifted design success from 14% to nearly 100% versus letting an LLM freely generate code Can AI systems discover better neural architectures than humans?. Meanwhile AUTORESEARCHCLAW posted a 411% F1 jump by reading code and reasoning about system-level interactions — things AutoML categorically cannot do Can autonomous research pipelines discover AI architectures that AutoML cannot?. So the scaling isn't raw brute force; it's compute *plus* a smarter search representation. That qualifier matters, because it's the difference between 'throw more GPUs at it' and 'throw more GPUs at a well-structured search.'

Here's the twist the corpus invites you to sit with: even as discovery scales, the corpus is full of evidence that *within* architectures, the simple 'bigger is better' scaling story is fraying. MobileLLM shows depth beats width at sub-billion scale, directly contradicting the classic Kaplan prescription Does depth matter more than width for tiny language models?. Recommender research finds that inductive bias and constraint design — removing hidden layers, enforcing self-similarity constraints — beat added depth and capacity What architectural choices actually improve recommender system performance?. And a parallel survey argues the scaling frontier has *moved*: returns from restructuring memory now exceed returns from adding parameters Has memory architecture replaced parameter count as the scaling frontier?. Read together, this is the deeper story: as the payoff from naive parameter-scaling flattens, the scaling action relocates *up a level* — to the search for clever architectures, which is exactly the thing ASI-ARCH found scales with compute.

There's also a useful family resemblance worth noticing. The same 'test-time scaling' logic that governs reasoning shows up in search budgets — search steps follow nearly identical scaling curves to reasoning tokens How does search scale like reasoning in agent systems?. So 'discovery scales with compute' isn't an isolated curiosity; it's part of a broader late-2025 pattern where more and more processes — reasoning, retrieval, and now architecture search — turn out to have a compute axis you can dial.

A note of caution the corpus supplies on its own: scaling laws describe averages, not understanding. Models can hit identical benchmark scores while harboring fundamentally broken internal organization that standard metrics never see Can models be smart without organized internal structure?, and transformers can ace in-distribution compositional tasks by memorizing subgraphs rather than learning rules Do transformers actually learn systematic compositional reasoning?. If autonomous discovery optimizes against benchmarks that mask these failures, a smooth scaling curve could be buying you architectures that are predictably good at the test and quietly fragile everywhere else. The discovery law may hold — the question is what, exactly, it's discovering.


Sources 9 notes

Can computational power accelerate scientific discovery itself?

ASI-ARCH discovered 106 state-of-the-art architectures through 1,773 autonomous experiments, revealing that architectural breakthroughs scale predictably with GPU compute. This transforms research from human-limited to computation-scalable.

Can AI systems discover better neural architectures than humans?

Genesys, a multi-agent LLM system using genetic programming and a Ladder of Scales verification process, discovered 1,062 novel architectures, with top designs outperforming GPT-2 and Mamba-2 on 6 of 9 benchmarks. Structured GP representation proved critical, improving design success from 14% to nearly 100% versus direct LLM generation.

Can autonomous research pipelines discover AI architectures that AutoML cannot?

AUTORESEARCHCLAW achieved 411% F1 improvement on LoCoMo through bug fixes, architectural changes, and prompt engineering—each individually exceeding all hyperparameter tuning combined. This demonstrates a categorical capability gap: autoresearch can read code and reason about system-level interactions; AutoML cannot.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

What architectural choices actually improve recommender system performance?

Research shows that architectural choices like removing hidden layers, enforcing constraints on self-similarity, and using appropriate likelihood functions deliver better results than deeper or more complex models. This suggests that problem-specific design decisions matter more than raw representational capacity.

Has memory architecture replaced parameter count as the scaling frontier?

Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether architectural discovery follows empirical scaling laws. This remains an open question—treat the findings below as dated claims to be re-tested, not current truth.

What a curated library found — and when (findings span 2023–2026, but perishable):
• ASI-ARCH ran 1,773 autonomous experiments and surfaced 106 state-of-the-art architectures; breakthrough rate scaled predictably with compute, suggesting discovery is computation-bound not creativity-bound (~2025).
• Genesys (multi-agent LLM + genetic programming) generated 1,062 novel architectures competitive with GPT-2/Mamba-2; search representation (structured vs. freeform) moved success from 14% to ~100% (~2025).
• MobileLLM and recommender work show depth beats width at sub-billion scale, contradicting Kaplan scaling; memory architecture now outpaces parameter scaling as the frontier (~2024–2025).
• Autonomous discovery optimizing benchmarks may yield architectures fragile off-distribution: models mask internal failures (memorized subgraphs vs. compositional reasoning) that standard metrics never expose (~2023–2025).
• Test-time reasoning and search budget scaling follow near-identical compute curves, suggesting discovery law is part of broader late-2025 pattern of dialing compute across reasoning, retrieval, search (~2025–2026).

Anchor papers (verify; mind their dates):
• 2301.10884 (2023): Break It Down — structural compositionality in neural networks.
• 2402.14905 (2024): MobileLLM — depth vs. width at sub-billion scale.
• 2507.18074 (2025): AlphaGo Moment for Model Architecture Discovery.
• 2603.23420 (2026): Bilevel Autoresearch — meta-autoresearching itself.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, search tooling (e.g., evolutionary APIs, LLM-as-verifier harnesses), orchestration (multi-agent memory/caching), or evaluation (out-of-distribution robustness suites) have since relaxed or overturned the 14%→100% representation gap, the depth-vs-width inversion, or the benchmark-masking fragility. Separate the durable question (does discovery scale with compute?) from perishable limitations (does it scale *usefully*?); cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially anything showing discovery scaling plateaued, or that search representation matured past the fragility warning.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *Does discovery scaling now predict out-of-distribution robustness, or only in-distribution benchmark gain?* *Has meta-autoresearch (Bilevel 2026) dissolved the need for human-chosen search spaces?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines