Can persona-based approaches capture genuine disagreement in expert annotations?
This explores whether prompting an LLM to role-play different expert annotators can reproduce the real, meaningful disagreement that human experts have — or whether it just generates noise that looks like disagreement.
This explores whether persona-prompted LLMs can stand in for panels of human experts and reproduce their genuine disagreement — and the corpus is mostly skeptical, for a reason worth understanding. The starting point is that expert disagreement is often a real signal, not a mistake to be averaged away. Work on interpretation modeling shows that when readers disagree about a socially loaded sentence, the spread of answers reflects valid differences in social position, not sloppy labeling Why do readers interpret the same sentence so differently?. Annotation responses themselves decompose into distinct kinds — genuine preferences, non-attitudes, and preferences constructed on the spot — that only reveal themselves through consistency across repeated measurement Do all annotation responses measure the same underlying thing?. So 'capturing disagreement' really means capturing *stable, structured* difference, not random scatter.
That distinction is exactly where persona prompting breaks. When the same persona prompt is run repeatedly, the variation between runs matches or exceeds the variation between different personas — meaning the model's own uncertainty, not any stable social viewpoint, is driving the output Why do LLM persona prompts produce inconsistent outputs across runs?. A persona panel can manufacture something that statistically resembles disagreement, but it's the wrong kind: run-to-run noise wearing the costume of perspective. Since the whole value of expert disagreement is that it's reproducible and traceable to a position, this is close to a disqualifying failure for the simulate-the-annotators use case.
There's a second, deeper problem specific to *expert* annotation: LLMs strip out the social scaffolding that makes an expert's judgment weigh more than a layperson's. Models process only text, so they lose the reputation, track record, and standing that give an expert claim its force, and they can't reliably tell an expert argument from a commonly held assumption Can language models distinguish expert arguments from common assumptions?. Worse, when authority *is* signaled, LLMs over-trust it — judge models fall for fake credentials and impressive formatting in zero-shot attacks Can LLM judges be fooled by fake credentials and formatting?. So a persona labeled 'senior domain expert' doesn't carry expert judgment; it carries the surface markers of expertise, which the model treats as a cue to defer rather than a capacity to reason.
The corpus does show personas working in adjacent, gentler settings, which is what makes the answer 'mostly no' rather than 'flatly no.' Grounding personas in real stakeholder documents rather than invented roles makes multi-agent evaluation reproducible and transferable across tasks Can personas extracted from documents generalize across evaluation tasks?, and persona simulations can replicate about three-quarters of published experimental main effects — but tellingly, success tracks the strength of the original effect and collapses on the marginal cases Can AI personas reliably replicate human experiment results?. Disagreement among experts lives precisely in those marginal, contested cases. There's even a view that post-training installs genuine, robust personas rather than mere pretense Are LLM personas realized or merely simulated through training?, and RL training can cut a simulated persona's drift by more than half Can training user simulators reduce persona drift in dialogue? — but stability and realism are about being a *consistent single voice*, which is almost the opposite of representing a contested field.
The thing you might not have known you wanted to know: the field has started treating disagreement as something to *measure and preserve* rather than collapse — modeling the full distribution of human interpretations as meaningful data. Persona simulation fails this not because it disagrees too little, but because it disagrees in the wrong currency: noise instead of position, performed authority instead of earned expertise. If you want to capture genuine expert disagreement, the more promising path the corpus points to is structuring and decomposing real human annotations, not synthesizing artificial annotators to argue in their place.
Sources 9 notes
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.
Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.