INQUIRING LINE

How does the distance between natural language and formal notation affect translation accuracy?

This explores whether the bigger the gap between everyday language and rigid formal notation (logic, math, symbolic code), the worse LLMs get at translating between them — and what the corpus says about why.


This explores whether the bigger the gap between everyday language and rigid formal notation (logic, math, symbolic code), the worse the translation gets. The corpus suggests the distance matters a lot — but not in the obvious way. When LLMs convert natural language into formal logic, they reliably produce notation that is *syntactically* valid but *semantically* wrong, with errors clustering exactly where natural language is fuzzy and formal notation demands precision: scope ambiguity, quantifier precision, and predicate granularity Can large language models translate natural language to logic faithfully?. In other words, the failures live precisely at the seams where the two languages don't line up — the model can mimic the shape of logic without carrying the meaning across.

The deeper reason this happens shows up in work on grammatical competence: LLMs handle simple structures well but degrade predictably as syntactic depth, recursion, and embedding increase, which points to them having learned surface heuristics rather than real structural rules Does LLM grammatical performance decline with structural complexity? Why do large language models fail at complex linguistic tasks?. Formal notation is *all* structure — so the very thing that makes notation precise is the thing models are weakest at. A related clue: models systematically prefer high-frequency surface phrasings over semantically equivalent rare ones, tracking statistical mass from pretraining rather than meaning Do language models really understand meaning or just surface frequency?. Formal translation rewards meaning-fidelity and punishes pattern-matching, which is exactly backwards from how these models seem to work.

The most surprising finding is that closing the distance *completely* makes things worse, not better. Full formalization strips out the semantic richness of the original language, while pure natural language lacks the structure needed to reason cleanly — so the sweet spot is in the middle. Selectively enriching natural language with just a few symbolic elements beats both pure prose and full symbolic logic, yielding measurable accuracy gains Why does partial formalization outperform full symbolic logic?. The distance between the two languages isn't something to eliminate; it's something to bridge partially.

There's also a geometric counterpoint worth knowing about. Even as models fail at the *task* of translation, their internal activations spontaneously encode syntactic type and direction in a structured, almost symbolic geometry How do language models encode syntactic relations geometrically?. So the representational machinery for structure is partly there — the gap is in reliably *using* it under pressure. That pressure compounds with two other corpus findings: reasoning accuracy degrades sharply with longer inputs even far below the context limit Does reasoning ability actually degrade with longer inputs?, and models fail badly at recognizing genuine ambiguity, disambiguating only ~32% of cases where humans hit 90% Can language models recognize when text is deliberately ambiguous?. Formal translation often requires holding multiple readings of a sentence at once and picking the right one — precisely the move models can't make. So the distance between natural language and notation hurts most not because notation is hard, but because crossing it demands resolving the ambiguity that natural language tolerates and notation forbids.


Sources 8 notes

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Next inquiring lines