Why does expert character analysis outperform automated narrative summarization?
This explores why hand-built character profiles beat machine-generated plot summaries when the goal is understanding or predicting what a character does — and what that gap reveals about what summarization throws away.
This explores why expert character analysis outperforms automated narrative summarization — and the corpus locates the answer not in summarization being sloppy, but in what it structurally discards. The cleanest data point comes from the LIFECHOICE benchmark, where LLMs predicted characters' decisions across 388 novels: feeding the model an expert-written persona profile paired with memories relevant to that character's psychology beat automated summarization by about 5% Can LLMs predict character choices from narrative context?. The gap is small but telling. Summarization optimizes for compression — the gist of what happened — while character prediction needs the opposite: the durable interior logic of *who this person is*, which is exactly the texture compression strips out.
Why does summarization strip it? Two adjacent findings suggest the mechanism. First, when AI writes narrative itself, it systematically over-explains themes, favors tidy single-track plots, and avoids moral ambiguity Do AI stories explain their themes more than human stories do?. That same flattening instinct shows up in how it condenses — automated summaries gravitate toward the legible main thread and smooth over the contradictions and ambiguities that actually define a character's psychology. Second, the discourse-level signals that distinguish real narrative — character agency, who drives events, chronological structure — are resistant to surface rewrites precisely because they live in structure, not wording Can AI stories be detected without analyzing writing style?. A summary that captures plot points can miss these entirely.
There's a deeper reason a fixed expert profile helps, and it's about what the model lacks on its own. The 20-questions regeneration test shows LLMs don't commit to a single character — they hold a superposition of plausible characters and sample one at generation time, producing a different (locally consistent) answer each time you regenerate Do large language models actually commit to a single character?. An expert persona profile acts as an external anchor that collapses that superposition: it tells the model *which* version of the character to be, something an automated self-summary can't reliably supply because it inherits the same uncommitted wobble. Related work on training user simulators shows the cost of that wobble directly — persona drift, which dedicated consistency training cut by over 55% [[multi-turn-rl-for-persona-consistency-reduces-drift-by-55-percent-by-treating-si].
The interesting wrinkle — the thing you might not expect — is that automated isn't always worse. LLMs segment narrative events *closer to human consensus* than individual human annotators do, apparently because training on diverse text pre-computes a kind of statistical average Do language models segment events like human consensus does?. And combining a natural-language summary with raw scores beats either alone for psychological profiling, because the summary surfaces second-order patterns the numbers hide Can language summaries unlock hidden psychological patterns?. So the lesson isn't "experts good, automation bad." It's that automation excels at *consensus and aggregation* and fails at *committed individuality* — and character analysis is fundamentally a task of committed individuality. If you want to push further, MAJ-EVAL's document-grounded persona extraction is the frontier case: it tries to automate the expert profile itself by grounding personas in real source material rather than arbitrary roles Can personas extracted from documents generalize across evaluation tasks?, which is the most promising route to closing the 5% gap.
Sources 8 notes
The LIFECHOICE benchmark (1,462 decisions across 388 novels) shows LLMs predict character choices better when given expert-written persona profiles paired with retrieved memories relevant to the character's psychology. This persona-based approach outperforms automated summarization by 5%.
Analysis of 304 narrative features reduced to 30 core signals shows AI fiction systematically over-explains themes, uses tidy single-track plots, and avoids moral ambiguity, while human stories employ temporal complexity and nonlinear structure. This pattern holds across all five major LLM models tested.
StoryScope achieved 93.2% accuracy separating AI from human fiction using only discourse-level features like character agency and chronological structure, retaining 97% of performance while eliminating stylistic cues. These structural choices resist humanization because they require rewrites, not surface edits.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
GPT-3's event boundaries correlate more strongly with averaged human annotations than individual human annotators do. This suggests language models may pre-compute statistical consensus through training on diverse text, or that next-token prediction parallels human event cognition.
LLMs generate natural language personality summaries from Big Five scores that encode second-order trait patterns, enabling zero-shot prediction of nine other psychological scales with R² > 0.89 structural alignment. Combined summary-and-score predictions outperform either alone, showing synergistic information.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.