INQUIRING LINE

Can LLMs translate between natural language and formal logic faithfully?

This explores whether LLMs can faithfully convert natural language into formal logic (and back) — not just produce well-formed symbols, but preserve the actual meaning — and what the corpus suggests fixes the gap.


This explores whether LLMs can faithfully translate between natural language and formal logic — meaning preserved, not just syntax. The corpus's blunt answer: they produce logic that looks right but means the wrong thing. Can large language models translate natural language to logic faithfully? finds models reliably generate syntactically valid expressions that are semantically incorrect, with errors clustering exactly where translation is hardest — scope ambiguity, quantifier precision, and predicate granularity. Intriguingly, the failure is asymmetric: models read formal language better than they write it.

Why the gap? A deeper note argues LLMs aren't doing symbolic manipulation at all. Do large language models reason symbolically or semantically? shows that when you strip the familiar semantic content out of a reasoning task and leave only the formal rules, performance collapses — models lean on commonsense token associations, not logical structure. The same contamination shows up mechanistically: How do language models perform syllogistic reasoning internally? traces a genuine content-independent reasoning circuit inside the model, but finds attention heads encoding world knowledge that bias conclusions toward what sounds plausible over what's logically valid — and this gets worse at larger scales, not better. Translation requires holding meaning still; these models keep letting meaning leak in.

A related blind spot makes faithful translation even harder: ambiguity. Can language models recognize when text is deliberately ambiguous? shows GPT-4 correctly disambiguates only 32% of cases (humans hit 90%) because it can't hold multiple interpretations at once — and faithful formalization is largely the act of choosing the right reading among several. Why do large language models fail at complex linguistic tasks? adds that errors grow predictably with syntactic depth, so the harder the sentence structure, the less trustworthy the translation.

The more interesting thread is what actually works — and it isn't 'translate harder.' Three notes converge on division of labor over full conversion. Can symbolic solvers fix how LLMs reason about logic? has the LLM formulate symbolic representations while a deterministic solver runs the inference and feeds back machine-verifiable error messages — catching translation mistakes far better than the model critiquing itself. Why does partial formalization outperform full symbolic logic? goes further: partial formalization beats both pure language and full formalization, because full conversion throws away semantic information the symbols can't carry. And Can structured argument prompts make LLM reasoning more rigorous? shows structured prompts that force the model to surface hidden premises catch failures plain chain-of-thought skips.

The surprise worth leaving with: the corpus suggests LLMs can genuinely analyze language structure — Can language models actually analyze language structure? shows step-by-step reasoning building real syntactic trees. So the bottleneck isn't ignorance of logic. It's a deeper split, sharpest in Can LLMs understand concepts they cannot apply?: models can explain a concept correctly, fail to apply it, and recognize the failure — explanation and execution run on disconnected tracks. Faithful translation needs both to fire together, which is exactly why offloading the execution to a symbolic solver outperforms asking the model to be faithful on its own.


Sources 10 notes

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

How do language models perform syllogistic reasoning internally?

LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can symbolic solvers fix how LLMs reason about logic?

Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking LLM reasoning faithfulness. QUESTION (still open): Can LLMs translate between natural language and formal logic while preserving meaning—or does the gap between explanation and execution doom faithful formalization?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints:
  • LLMs generate syntactically valid but semantically incorrect logic; errors cluster at scope ambiguity, quantifier precision, predicate granularity (2023–24).
  • Models read formal language better than they write it; they reason via commonsense token associations, not logical structure—performance collapses when semantics are stripped (2023).
  • Attention heads encode world knowledge that biases conclusions toward plausibility over logical validity; this worsens at scale (2024).
  • GPT-4 disambiguates only 32% of cases vs. 90% human accuracy; faithful formalization requires holding multiple interpretations simultaneously (2023).
  • Hybrid approaches (LLM formulation + symbolic solver verification) and partial formalization outperform pure conversion; models can explain correctly but fail execution (disconnected tracks) (2024–2025).

Anchor papers (verify; mind their dates):
  • arXiv:2305.14825 (2023): In-context semantic reasoning vs. symbolic reasoning.
  • arXiv:2408.08590 (2024): Syllogistic reasoning circuits—mechanistic analysis.
  • arXiv:2305.12295 (2023): Logic-LM—symbolic solver integration.
  • arXiv:2502.12616 (2025): Quasi-symbolic abstractions.

Your task:
  (1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether post-2025 advances in model scale, reasoning-time compute (test-time scaling, chain-of-thought variants, process reward models), tool integration (formal verification SDKs), or structured prompting have RELAXED or OVERTURNED these bottlenecks. Separate the durable question (likely: can execution and explanation be unified without external solvers?) from the perishable limitation (likely: raw symbolic reasoning without scaffolding). Cite what resolved it.
  (2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (late 2025–early 2026) that claims faithful translation IS now feasible, or that the division-of-labor approach was a local maximum.
  (3) Propose 2 research questions that ASSUME the regime may have shifted: (a) given test-time scaling and process rewards, can LLMs learn to verify their own translations without external solvers? (b) does end-to-end training on formalization + execution tasks (rather than separate stages) yield faithful translation where separation failed?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines