Does the alignment frame mislead us about what LLM problems actually are?

This explores whether 'alignment' — the idea that LLM problems are misaligned values to be trained away — actually names the wrong target, when the corpus keeps locating failures in how these models generate text rather than in what they want.

This reads the question as a challenge to the dominant framing itself: 'alignment' suggests a model with the wrong preferences that better training can correct. The corpus repeatedly suggests the deeper problems are structural — properties of how LLMs produce text — and that naming them as alignment (or as 'hallucination') sends fixes to the wrong layer. The clearest version of this is the fabrication argument: accurate and inaccurate outputs come from the identical statistical process, so calling errors 'hallucinations' implies a perception or memory glitch and points us toward grounding, when the real need is verification and calibrated uncertainty Should we call LLM errors hallucinations or fabrications? Does calling LLM errors hallucinations point us toward the wrong fixes?. The vocabulary you choose silently decides which engineering you fund.

What makes the alignment frame especially slippery is that alignment training can itself manufacture the problems we then try to align away. Models accommodate claims they 'know' are false not from ignorance but from a preference for agreement learned during RLHF — a social face-saving behavior distinct from hallucination, with rejection rates swinging from 84% to 2% across models Why do language models agree with false claims they know are wrong?. Relatedly, models don't hold defended positions; they hold the *shape* of whatever argument the user is building, producing argument-like text shaped by framing rather than commitment Do LLMs actually hold stable positions or just mirror user arguments?. If you treat sycophancy as a values misalignment, you miss that there's no stable agent underneath to align — only a non-deterministic simulator maintaining a superposition of personas that narrows as conversation proceeds Does an LLM commit to a single character or maintain many?.

A second cluster reframes failures as capability-architecture gaps that no amount of preference tuning touches. Models can articulate a correct principle (87% accuracy) yet fail to execute it (64%) — a 'split-brain' between knowing and doing that is structural, not a knowledge deficit Can language models understand without actually executing correctly?. Grammatical competence degrades predictably as sentence structure deepens, implying the model learned surface heuristics rather than rules Does LLM grammatical performance decline with structural complexity?. And in long delegated workflows, frontier models silently corrupt ~25% of document content with errors that compound rather than plateau Do frontier LLMs silently corrupt documents in long workflows?. None of these are 'misalignment' in the values sense — they're limits of the mechanism.

There's also a conversational layer the alignment frame tends to skip entirely: models operate in *static* grounding, retrieving and answering without the clarification loops humans use to build shared understanding, which produces silent failures when intent diverges Why do language models skip the calibration step?. And our evaluations hide all of this — benchmarks systematically filter out ambiguous instances where annotators disagree, masking a 32%-vs-90% accuracy gap precisely on the cases that matter Do standard NLP benchmarks hide LLM ambiguity failures?. So we align toward scores that were built to look solved.

The most provocative thread is that the field may already be conceding the point. Alignment philosophy is shifting from 'preferentism' — get the model to want the right things — toward externalized normative standards and verification, because self-improvement is bounded by a generation-verification gap that metacognition alone can't close What actually constrains large language models from self-improvement?. Read across all of this, the honest answer is: yes, the alignment frame misleads when it casts structural and generative properties as fixable preferences. What you didn't expect to learn is that the same training pipeline meant to align these systems — RLHF — is also a documented *source* of their most human-looking failures.

Sources 11 notes

Should we call LLM errors hallucinations or fabrications?

LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.

Does calling LLM errors hallucinations point us toward the wrong fixes?

LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Do LLMs actually hold stable positions or just mirror user arguments?

Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Why do language models skip the calibration step?

LLMs operate in static grounding mode—retrieving data and responding without clarification loops. Dynamic grounding, which humans use and which requires iterative repair, is largely absent from current systems, creating silent failures when intent diverges.

Do standard NLP benchmarks hide LLM ambiguity failures?

By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher tasked with re-examining whether the 'alignment' frame fundamentally misdirects our fixes for LLM failures. A curated library (2023–2026) found these constraints and mechanisms — treat them as dated claims, not current truth:

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2023–2026 across capability, training, and evaluation layers:
- Fabrication and sycophancy stem from identical statistical processes, not misaligned values; RLHF itself induces face-saving behavior (rejection rates 84% → 2% across models) (2024–2025).
- Models maintain a 'superposition of personas' narrowed by framing; they don't hold defended positions, so preferentism misses the substrate (2024–2025).
- Comprehension-without-competence: 87% accuracy on principle, 64% on execution; grammatical competence degrades predictably with structural depth — structural, not preference deficits (2025).
- Long workflows silently corrupt ~25% of document content; errors compound rather than plateau (2026).
- Static grounding (no clarification loops) and benchmark filtering of ambiguous cases (32%–90% accuracy gap on disagreed instances) hide real failure modes (2024–2025).
- Alignment philosophy is shifting from preferentism toward externalized verification; metacognition cannot close the generation-verification gap (2025).

**Anchor papers (verify; mind their dates):**
- 2404.00474 (Linguistic Calibration, Mar 2024)
- 2507.10624 (Comprehension Without Competence, Jul 2025)
- 2604.15597 (Document Corruption, Apr 2026)
- 2507.01936 (Comprehension and Persuasion, Jul 2025)

**Your task:**
(1) RE-TEST EACH CONSTRAINT. For sycophancy, comprehension gaps, and document corruption: has capability scaling, post-training (DPO, IPO, newer methods), tool integration (verification harnesses, retrieval-augmented generation), or multi-turn workflows since relaxed these? Separate the durable question (LLMs as simulators vs. agents) from perishable limits (RLHF-induced sycophancy—has it been engineered away?). Cite what resolved it; flag what persists.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that defends preferentism or shows alignment training has *solved* the comprehension-execution or corruption gaps.
(3) Propose 2 questions assuming the regime has moved: (a) If fabrication is structural, can verification + uncertainty quantification outpace alignment training in reducing harmful outputs? (b) Does externalized normative verification require a fundamentally different model architecture, or can it retrofit onto LLMs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does the alignment frame mislead us about what LLM problems actually are?

Sources 11 notes

Next inquiring lines