INQUIRING LINE

What makes natural-language APIs particularly suited to LLM-based simulation?

This explores why interfaces defined in natural language — a search engine that returns text, a user who converses, a human who explains a decision — are the sweet spot for LLM simulation, while structured or numeric interfaces are not.


This reads the question as asking where LLM-based simulation actually works, and why the answer keeps coming back to natural language. The pattern across the corpus is striking: LLMs make convincing simulators precisely when the thing being simulated speaks in text. A search engine, viewed as an API, takes a query string and returns documents — and LLMs can fabricate those documents from internal knowledge well enough that a 14B simulator matches or beats a real engine for training purposes, no API calls required Can LLMs replace search engines during agent training?. A conversational user is likewise a natural-language interface, and conditioning a simulator on session-level profile and turn-level intent produces synthetic dialogue that crowdworkers and discriminators can't reliably tell from real Can controlled latent variables make LLM user simulators realistic?. Even human decision-making, when expressed as choices and rationales, can be modeled by a finetuned LLM more accurately than purpose-built cognitive theories Can language models learn to model human decision making?.

The deeper reason shows up when you look at where simulation breaks. The same models that ghost-write search results plateau at 55–60% on genuine constraint satisfaction regardless of size Do larger language models solve constrained optimization better?, can't actually run iterative numerical procedures (they pattern-match a memorized template and emit plausible-but-wrong numbers) Do large language models actually perform iterative optimization?, and fail on relational queries that need real joins across structured tables even when the whole table fits in context Can long-context LLMs replace retrieval-augmented generation systems?. So the dividing line isn't difficulty — it's the interface. Natural-language APIs are forgiving in exactly the way LLMs need: the output is judged by plausibility, and for an LLM plausibility and correctness nearly coincide. Structured APIs demand executed computation, where a plausible-looking answer is just wrong.

There's a more fundamental version of this point in the corpus. Treating an LLM as an autoregressive probability machine predicts that tasks succeed when the target response is high-probability and fail when it isn't Can we predict where language models will fail?. Natural-language interfaces traffic in exactly the kind of high-probability, distributionally-typical text the model was trained on — which is why simulating 'what a user would plausibly say next' lands while simulating 'the exact optimum of this constraint set' doesn't.

The most interesting framing, though, is that natural-language APIs let the simulator and the simulated share a medium. From inside a discourse, humans and LLMs draw on the same symbolic substrate — language itself — which makes the gap between them structural rather than absolute Do humans and LLMs differ fundamentally or just superficially?. A natural-language API is essentially a slot where that shared substrate is the whole interface, so the model isn't translating into a foreign representation; it's operating in its native one. Two caveats keep this honest: when you push LLMs toward simulating *actions* in the world rather than text, the surrounding harness — memory, tools, grounding — decides whether the action is real or hallucinated, not the model alone Can you turn an LLM into an agent by just fine-tuning?. Natural language is where LLM simulation is cheap and convincing; the moment the API stops being language and starts being execution, the magic stops with it.


Sources 9 notes

Can LLMs replace search engines during agent training?

ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.

Can controlled latent variables make LLM user simulators realistic?

RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about where LLM-based simulation actually works. The precise question: do natural-language APIs remain uniquely suited to LLM simulation, or have advances in model capability, structured reasoning, tool grounding, and evaluation since mid-2025 flattened that distinction?

What a curated library found — and when (dated claims, not current truth): The findings span 2023–2026 and centre on a structural match between LLM capability and interface type:
• LLMs achieve 14B-scale parity with real search engines on synthetic doc generation; plateau at 55–60% on constraint satisfaction regardless of scale (~2024–2026).
• Natural-language user simulators fool crowdworkers; LLMs cannot reliably execute iterative numerical procedures or handle relational joins even in long-context windows (~2024–2026).
• The dividing line is interface: plausibility (LLM strength) vs. executed computation (LLM weakness); LLMs and humans share language as native substrate, but harness + tooling (memory, grounding) separate simulated text from simulated action (~2024–2025).
• Self-improvement and strategic reasoning under constraints remain open challenges; recent work explores incentivized search and behavioral game theory framing (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2406.13121 (2024-06) — Long-context limitations on structured queries
• arXiv:2603.23004 (2026-03) — Constraint reasoning ceiling
• arXiv:2502.20432 (2025-02) — Strategic reasoning via game theory
• arXiv:2605.30621 (2026-05) — Harness vs. model capability disentanglement

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above (55–60% plateau, numerical failure, join failure, harness dependence), determine whether newer models (o1-class reasoning, o3 scaling laws), structured reasoning layers (proof search, tableau methods), tool orchestration (multi-agent retrieval, cached memory), or evaluation harnesses have since dissolved or inverted these limits. Separate the durable question (does language-native-ness still confer simulation advantage?) from the perishable limitation (can LLMs now execute constraints or joins?). Cite what flattened the gap, plainly state where constraints still hold.

(2) SURFACE THE STRONGEST CONTRADICTING OR SUPERSEDING WORK from the last ~6 months. Look for papers that claim LLMs *can* simulate structured APIs, or where harness design + orchestration make the interface type irrelevant, or where a 2024 plateau claim is re-tested and overturned.

(3) PROPOSE 2 RESEARCH QUESTIONS that assume the regime may have shifted: (a) If constraint satisfaction is no longer a ceiling, what *is* the new binding constraint on LLM simulation fidelity? (b) Has the distinction between 'natural-language API' and 'structured API simulation' become a harness problem rather than a model problem?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines