INQUIRING LINE

Why do LLMs generate logical forms without preserving semantic content?

This explores why LLMs can produce well-formed logical expressions (valid syntax) while getting the actual meaning wrong — and what that gap reveals about how these models 'reason' at all.


This explores why LLMs can produce well-formed logical expressions while getting the actual meaning wrong. The corpus points to a single underlying cause: these models match the *shape* of formal language without operating on what it denotes. The clearest evidence comes from autoformalisation work showing LLMs reliably generate syntactically valid logic that is semantically incorrect, with errors clustering exactly where meaning lives — scope ambiguity, quantifier precision, predicate granularity Can large language models translate natural language to logic faithfully?. The form is easy because form is surface pattern; the content is hard because content requires tracking what the symbols are *about*.

Why the split? Because LLMs reason by semantic association, not symbolic manipulation. When researchers strip the familiar real-world meaning out of a reasoning task and leave only the abstract rules, performance collapses — even when the correct rules sit right there in context Do large language models reason symbolically or semantically?. The model was leaning on token associations and parametric commonsense the whole time, not on the logical structure it appeared to be using. So when you ask it to emit pure logical form, you remove the very crutch it was reasoning with, and the content drifts.

There's a deeper mechanism beneath this. Token generation is a smooth probabilistic flow toward the training distribution, not a turbulent exploration of competing claims Does LLM generation explore competing claims while producing text?, and that flow is sequential but atemporal — there's no pause for reflection or revision in which a model could check whether its formula actually means what the sentence meant Does AI text generation unfold through temporal reflection?. The same pattern-over-meaning bias shows up elsewhere: semantically identical prompts produce different outputs because the model registers corpus *frequency*, not equivalence of meaning Why do semantically identical prompts produce different LLM outputs?. Whether the task is paraphrasing or formalizing, the model tracks statistical mass over sense.

The most useful surprise here is that the fix isn't 'more formalization' — it's *less*. Partial symbolic abstraction beats both pure natural language and full formal logic: enriching language with selective symbolic structure preserves the semantic information that complete formalization throws away Why does partial formalization outperform full symbolic logic?. Full formalization is precisely the regime where semantic content gets stranded, which is why hybrid prompting that forces a model to check warrants and implicit premises catches errors that clean-looking logical chains hide Can structured argument prompts make LLM reasoning more rigorous?. The logical form, paradoxically, is where meaning goes to get lost.

If you want to push on the 'why' further, one strand of the corpus argues the real reasoning never happens in the surface symbols at all — it lives in latent hidden-state trajectories, with the visible chain (or logical form) serving as only a partial, sometimes unfaithful interface Where does LLM reasoning actually happen during generation?. On that view, asking why the generated logic doesn't preserve meaning is asking why a rendering doesn't match the thing it renders: the form was always a downstream projection, not the computation itself.


Sources 8 notes

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Does AI text generation unfold through temporal reflection?

Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Next inquiring lines