INQUIRING LINE

Can persona-based approaches capture genuine disagreement in expert annotations?

This explores whether prompting an LLM to role-play different expert annotators can reproduce the real, meaningful disagreement that human experts have — or whether it just generates noise that looks like disagreement.


This explores whether persona-prompted LLMs can stand in for panels of human experts and reproduce their genuine disagreement — and the corpus is mostly skeptical, for a reason worth understanding. The starting point is that expert disagreement is often a real signal, not a mistake to be averaged away. Work on interpretation modeling shows that when readers disagree about a socially loaded sentence, the spread of answers reflects valid differences in social position, not sloppy labeling Why do readers interpret the same sentence so differently?. Annotation responses themselves decompose into distinct kinds — genuine preferences, non-attitudes, and preferences constructed on the spot — that only reveal themselves through consistency across repeated measurement Do all annotation responses measure the same underlying thing?. So 'capturing disagreement' really means capturing *stable, structured* difference, not random scatter.

That distinction is exactly where persona prompting breaks. When the same persona prompt is run repeatedly, the variation between runs matches or exceeds the variation between different personas — meaning the model's own uncertainty, not any stable social viewpoint, is driving the output Why do LLM persona prompts produce inconsistent outputs across runs?. A persona panel can manufacture something that statistically resembles disagreement, but it's the wrong kind: run-to-run noise wearing the costume of perspective. Since the whole value of expert disagreement is that it's reproducible and traceable to a position, this is close to a disqualifying failure for the simulate-the-annotators use case.

There's a second, deeper problem specific to *expert* annotation: LLMs strip out the social scaffolding that makes an expert's judgment weigh more than a layperson's. Models process only text, so they lose the reputation, track record, and standing that give an expert claim its force, and they can't reliably tell an expert argument from a commonly held assumption Can language models distinguish expert arguments from common assumptions?. Worse, when authority *is* signaled, LLMs over-trust it — judge models fall for fake credentials and impressive formatting in zero-shot attacks Can LLM judges be fooled by fake credentials and formatting?. So a persona labeled 'senior domain expert' doesn't carry expert judgment; it carries the surface markers of expertise, which the model treats as a cue to defer rather than a capacity to reason.

The corpus does show personas working in adjacent, gentler settings, which is what makes the answer 'mostly no' rather than 'flatly no.' Grounding personas in real stakeholder documents rather than invented roles makes multi-agent evaluation reproducible and transferable across tasks Can personas extracted from documents generalize across evaluation tasks?, and persona simulations can replicate about three-quarters of published experimental main effects — but tellingly, success tracks the strength of the original effect and collapses on the marginal cases Can AI personas reliably replicate human experiment results?. Disagreement among experts lives precisely in those marginal, contested cases. There's even a view that post-training installs genuine, robust personas rather than mere pretense Are LLM personas realized or merely simulated through training?, and RL training can cut a simulated persona's drift by more than half Can training user simulators reduce persona drift in dialogue? — but stability and realism are about being a *consistent single voice*, which is almost the opposite of representing a contested field.

The thing you might not have known you wanted to know: the field has started treating disagreement as something to *measure and preserve* rather than collapse — modeling the full distribution of human interpretations as meaningful data. Persona simulation fails this not because it disagrees too little, but because it disagrees in the wrong currency: noise instead of position, performed authority instead of earned expertise. If you want to capture genuine expert disagreement, the more promising path the corpus points to is structuring and decomposing real human annotations, not synthesizing artificial annotators to argue in their place.


Sources 9 notes

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Can personas extracted from documents generalize across evaluation tasks?

MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.

Can AI personas reliably replicate human experiment results?

Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-evaluating whether persona-based LLM approaches can authentically capture genuine expert disagreement—a question the field treats as mostly unresolved.

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026; all are perishable constraints to be re-tested.

• Run-to-run instability in persona-prompted outputs often matches or exceeds inter-persona variation, suggesting model uncertainty dominates over stable viewpoint (~2024).
• LLMs lack access to the reputation, track record, and earned standing that give expert judgment its force; they process only text and conflate surface authority markers with reasoned expertise (~2024).
• Persona simulations replicate ~76% of published experimental main effects, but success collapses on marginal, contested cases—precisely where genuine expert disagreement concentrates (~2024).
• Multi-turn RL training can reduce persona drift by >55%, improving consistency as a *single voice* rather than as a representative of a contested field (~2025).
• Grounding personas in real stakeholder documents (rather than invented roles) enables cross-task reproducibility, but this still doesn't solve the deeper problem of capturing *stable disagreement* (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2312.03726 (2023-11): Interpretation modeling and social grounding of sentences
- arXiv:2402.10669 (2024-02): Judgement biases in LLM judges vs. humans
- arXiv:2511.00222 (2025-10): Multi-turn RL for persona consistency
- arXiv:2604.03238 (2026-01): Human preference measurement as a social science problem

Your task:

(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether advances in model scale, instruction-tuning, in-context learning, multi-turn agentic systems, or new evaluation frameworks (e.g., social-science-informed preference measurement) have since relaxed or overturned the instability, authority-capture, or marginal-case collapse. Distinguish the durable question (can LLMs represent genuine *structured* disagreement?) from the perishable limitation (current methods fail because of noise, not architectural limits). Cite what resolved it; state plainly where the constraint still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look especially for: (a) evidence that frontier models recover expert disagreement without persona prompting (e.g., via uncertainty quantification or mixture-of-experts); (b) new social-science framings that redefine what "capturing disagreement" even means; (c) empirical wins in cross-domain persona transfer or in modeling marginal cases.

(3) Propose 2 research questions that ASSUME the regime may have moved—one targeting the mechanism (What if persona drift is not a noise problem but a feature of honest disagreement-under-uncertainty?), one targeting the measurement (If we treat LLM disagreement as a *distribution* rather than a set of discrete positions, can we measure its fidelity to expert disagreement without assuming stable personas?).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines