Can LLMs successfully translate natural language into formal solver specifications?
This explores whether LLMs can take a plain-language problem and faithfully convert it into the structured input a logic engine or optimization solver actually needs — and what breaks when they try.
This explores whether LLMs can take a plain-language problem and faithfully convert it into the structured input a logic engine or optimization solver actually needs. The corpus gives a sharp, two-sided answer: LLMs are good at producing the *form* of a formal spec but unreliable at preserving the *meaning* — and that gap is exactly where systems succeed or fail.
The core finding is an asymmetry. LLMs generate syntactically valid logical expressions that are semantically wrong, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity Can large language models translate natural language to logic faithfully?. Interestingly, they seem to *read* formal language better than they *write* it — recognition outruns generation. That single asymmetry reframes the whole question: the bottleneck isn't whether the model understands logic, it's whether it can commit a messy sentence to a precise formal commitment without quietly dropping a constraint.
The corpus's most constructive response is to stop asking the LLM to do the whole job alone. Logic-LM splits the labor — the LLM drafts a symbolic representation, a deterministic solver runs the actual inference, and the solver's machine-verifiable error messages feed back to catch translation mistakes Can symbolic solvers fix how LLMs reason about logic?. That feedback loop beats LLM self-critique precisely because the failure (bad translation) gets caught by something external, which echoes the broader result that models can't validate their own fixes without an outside checker What stops large language models from improving themselves?. A related design lesson: don't formalize everything. Partial symbolic augmentation — enriching natural language with selective formal structure rather than replacing it wholesale — outperforms both pure language and full formalization, because full formalization throws away semantic information the solver never needed in rigid form Why does partial formalization outperform full symbolic logic?.
There's also a harder ceiling lurking behind the translation question. Even when the spec is correct, LLMs plateau around 55–60% constraint satisfaction on genuine optimization problems, regardless of scale or reasoning training Do larger language models solve constrained optimization better? — and a deeper reason is that they pattern-match memorized solution templates instead of actually executing iterative numerical methods Do large language models actually perform iterative optimization?. This is the strongest argument *for* the solver-integration approach: if the model can't reliably solve, offload solving to the solver and make the LLM responsible only for an accurate translation it can then verify.
The counterweight worth knowing: LLMs genuinely can do structured linguistic and symbolic analysis when forced to reason step by step — building syntactic trees and phonological generalizations, not just performing language Can language models actually analyze language structure?. So the deficit isn't a flat inability to handle formal structure; it's faithfulness under translation. The takeaway you might not have expected: the path to reliable autoformalization isn't a bigger model, it's a tighter loop — let the LLM draft, let the solver judge, formalize only what must be formal, and treat every spec as something to verify rather than trust.
Sources 7 notes
LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.
Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.