Can structural perturbations harm model accuracy more than semantic ones?
This explores whether changing an input's structure — its syntax, formatting, or surface arrangement — degrades model accuracy more than changing its meaning, and the corpus suggests structure is often the sharper failure axis, but with an important twist about *why*.
This reads the question as: do structural disruptions (syntactic depth, surface rephrasing, the arrangement of an input) hurt models more than meaning-level changes? The corpus doesn't run a clean head-to-head experiment, but several notes converge on a striking pattern — models are surprisingly brittle to structure even when meaning is held constant.
The clearest evidence is that LLMs degrade *predictably* as structural complexity climbs. Top-tier models consistently misidentify embedded clauses, verb phrases, and nested nominals, and the error rate rises with syntactic depth — a sign that statistical learning captures surface patterns rather than the grammar underneath Why do large language models fail at complex linguistic tasks?. Even meaning-preserving structural noise leaves a mark: longer chain-of-thought reasoning *dampens* but never eliminates input perturbation sensitivity, because a non-zero robustness floor is baked into the architecture Can longer reasoning chains eliminate model sensitivity to input noise?. And surface rephrasing alone — same content, different wording — can swing outputs hard whenever the model isn't confident Does model confidence predict robustness to prompt changes?. So yes: structure can hurt accuracy in ways content doesn't.
But here's the thing the reader might not expect — the corpus pushes back on "structural complexity" being the real culprit. One note argues reasoning breakdowns aren't triggered by complexity thresholds at all, but by *instance-level unfamiliarity*: models fit memorized patterns rather than general algorithms, so a structurally gnarly problem succeeds fine if it resembles training data, while a simple but novel one fails Do language models fail at reasoning due to complexity or novelty?. That reframes the whole question: structural perturbation may hurt mostly because it pushes inputs into unfamiliar territory, not because structure is intrinsically hard.
There's also a layer underneath the accuracy number itself. A model can post perfect scores while its internal representations are fractured — linearly decodable but badly organized — which is exactly what makes it fragile to perturbation and distribution shift that standard metrics never reveal Can models be smart without organized internal structure?. So a structural perturbation isn't necessarily *causing* new weakness; it can be *exposing* a brokenness that was always there. Relatedly, models seem to defend themselves by sparsifying activations under out-of-distribution stress — a built-in selective filter that stabilizes performance rather than collapsing Do language models sparsify their activations under difficult tasks?.
The deeper lesson cutting across these notes: meaning-level pressure tends to fail loudly (a model ignoring its context because training priors override it Why do language models ignore information in their context?, or compounding its own earlier mistakes Do models fail worse when their own errors fill the context?), while structural pressure fails *quietly and predictably* — small, systematic accuracy erosion that scales with depth and unfamiliarity. If you're stress-testing a model, the corpus suggests the surface form of your input is a more revealing lever than you'd guess, precisely because it surfaces weaknesses that semantic tests leave hidden.
Sources 8 notes
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.