How does structural complexity affect LLM performance differently than inferential complexity?
This explores two different ways a task can get 'harder' for an LLM — structural complexity (how deeply nested and embedded the language or representation is) versus inferential complexity (how many reasoning steps must chain together) — and whether the model breaks down the same way under each.
This explores two different axes of difficulty: structural complexity (deep syntactic embedding, recursion, nested clauses) versus inferential complexity (long chains of reasoning steps). The corpus suggests they aren't just two flavors of the same problem — they produce distinct *shapes* of failure. Structural complexity tends to degrade performance smoothly and predictably, while inferential complexity tends to fall off a cliff.
On the structural side, several notes converge on the same pattern: LLMs handle simple sentences well and then decline in a graded, almost measurable way as syntactic depth and embedding increase Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. The breakdowns are localized and mappable — models do fine with explicit discourse markers but fail on implicit relations and forward-planning structure Where exactly do language models fail at structural language tasks?. The diagnosis is that models learned surface heuristics rather than real grammatical rules, so complexity just exposes the gap incrementally rather than triggering collapse.
Inferential complexity behaves differently. Here the failure isn't graded — it's exponential. Reasoning models 'wander' without the validity, effectiveness, and necessity that systematic search requires, so success probability drops exponentially with problem depth: medium problems are solvable, deep ones become catastrophically hard Why do reasoning LLMs fail at deeper problem solving?. And on genuine optimization tasks, models hit a hard ceiling around 55–60% constraint satisfaction regardless of scale or architecture, suggesting a structural limit rather than something more compute can fix Do larger language models solve constrained optimization better?. So structural difficulty erodes performance; inferential difficulty caps or shatters it.
The interesting thread underneath both is *why*. Notes on reasoning style argue LLMs lean on semantic associations rather than formal symbolic manipulation — strip the familiar semantics out and reasoning collapses even when the correct rules are sitting right there in context Do large language models reason symbolically or semantically?. That same disconnect shows up as 'potemkin understanding' and a kind of split-brain syndrome, where a model explains a concept correctly (87% accuracy) but fails to apply it (64%) Can LLMs understand concepts they cannot apply?, Can language models understand without actually executing correctly?. Inferential complexity stresses exactly this execution pathway, which is why it breaks so sharply.
What's worth knowing: the fixes that help track the diagnosis. For inferential complexity, the wins come from imposing external structure the model lacks internally — externalizing reasoning into knowledge-graph triples lets a small model jump 29% on hard tasks Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?, algorithmic control flow hides irrelevant context and turns one hard chain into debuggable sub-tasks Can algorithms control LLM reasoning better than LLMs alone?, and partial symbolic augmentation beats both pure language and full formalization Why does partial formalization outperform full symbolic logic?. In other words, structural complexity reveals what the model never learned, and inferential complexity reveals what it can't execute on its own — and the second is the one you scaffold your way around.
Sources 11 notes
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.