Why do LLM personas struggle with specificity in specialized domains like law?
This reads the question as asking why an LLM playing a domain expert (a lawyer, say) produces fluent-but-shallow output in specialized fields — and the corpus suggests the problem isn't missing facts but missing the structure that makes expertise specific.
This explores why an LLM asked to act as a specialist — a lawyer, a clinician — tends to sound right while staying generic, and the corpus points at several distinct failure layers rather than one. The first is the most concrete: in law specifically, models degrade on exactly the cases where specificity matters most. A Supreme Court overruling benchmark found systematic era sensitivity — models reason worse about historical precedent than modern cases — because the training corpus over-represents recent material, leaving shallow representations of older doctrine Why do language models struggle with historical legal cases?. Specificity in a specialized domain is often precisely the long-tail, the unusual precedent, the old rule still controlling — the part of the distribution the model saw least.
But even with the facts present, there's a deeper crack. A persona can correctly explain a concept and then fail to apply it — and recognize its own failure — a pattern researchers call potemkin understanding, which suggests the explanation pathway and the execution pathway are functionally disconnected rather than a simple knowledge gap Can LLMs understand concepts they cannot apply?. That's the texture of a fake-specialist answer: the doctrine recited correctly in the abstract, then misapplied to the case in front of it. Reasoning research adds the same shape from another angle — models wander rather than search systematically, so success drops exponentially with problem depth, and a real legal question is a deep, multi-step problem, not a shallow one Why do reasoning LLMs fail at deeper problem solving?.
The most interesting layer is social, not informational. What makes expert specificity load-bearing is that it comes from a person with standing — reputation, track record, the authority to say 'this argument wins.' A model processes only text, so it loses the social world where expertise is built and can't distinguish a genuine expert claim from a commonly held assumption that merely sounds expert Can language models distinguish expert arguments from common assumptions?. A persona inherits the vocabulary of a field without inheriting the judgment that tells a practitioner which of two equally fluent positions is actually defensible.
There's also a structural reason personas drift toward the generic. An LLM doesn't commit to one character; it holds a superposition of plausible simulacra that only narrows as the conversation supplies constraints Does an LLM commit to a single character or maintain many?. A thin prompt ('you are a lawyer') leaves that distribution wide, so it samples the average lawyer-sounding response — and a related study on personalized judgment shows that when persona information is sparse, the model simply lacks predictive power for specific cases and is more reliable when allowed to abstain than when forced to answer Why do LLM judges fail at predicting sparse user preferences?. Specificity needs constraint; a persona prompt rarely supplies enough.
The thread that ties these together — and the thing worth taking away — is that domain specificity isn't a single quantity the model is short on. It's the convergence of long-tail coverage, the gap between explaining and applying, the social authority that ranks competing claims, and the conversational constraint that collapses a vague persona into a sharp one. Fixing one doesn't fix the others, which is why a confident, fluent specialist persona can still be precisely the kind of expert no one should rely on.
Sources 6 notes
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
Sparse persona information lacks predictive power for specific preferences, causing LLM judges to fail. Verbal uncertainty estimation recovers reliability above 80% on high-certainty samples by allowing abstention rather than forced judgment.