Does adding survey data to interviews improve agent accuracy further?
This explores whether layering structured survey responses on top of open-ended interviews makes AI agents better at predicting how real people answer — and the corpus suggests the answer hinges less on stacking data sources than on what kind of information actually drives accuracy.
This explores whether adding survey data to interviews pushes agent accuracy higher, and the most direct evidence in the collection points to a more interesting question underneath it. The flagship study here built generative agents from two-hour voice interviews with 1,052 people and found they replicated participants' own survey answers about 85% as well as those people replicated themselves on retest Can AI agents learn people better from interviews than surveys?. The striking detail is *why*: accuracy was driven by factual content, not linguistic style — and even when the rich interview was compressed down to summary bullet points, fidelity only dropped to 83%. That tells you the signal lives in substantive personal information, not in conversational texture. So the real question isn't "interview plus survey," it's "does the second source add factual content the first one missed?" If your survey is capturing attitudes the interview already surfaced, you're adding redundancy, not accuracy.
That framing reframes the whole debate. A separate line of work shows AI personas reproduce about 76% of published experimental main effects, with success tightly correlated to how strong the original effect was — and unreliable performance on marginal effects, where they generate both false positives and false negatives Can AI personas reliably replicate human experiment results?. The lesson that travels across both studies: agents are good at recovering robust, well-evidenced signal and shaky at the margins. Adding a survey helps to the exact extent it strengthens weak or missing signal; it won't rescue genuinely ambiguous cases.
There's also a quieter cost the collection flags. Persona-driven agents drift over multi-turn interaction — losing consistency within turns, across conversations, and through outright factual contradictions — and reducing that drift took dedicated reinforcement training, not more profile data Can training user simulators reduce persona drift in dialogue?. More input fields can actually widen the surface for contradiction. And work on grounding personas in real source documents found that *where* the persona comes from (real stakeholder perspectives vs. arbitrary roles) matters more for generalization than how many attributes you pile on Can personas extracted from documents generalize across evaluation tasks?.
If you zoom out, the collection keeps returning to a theme: agent reliability comes from how information is structured and externalized, not from sheer volume of it Where does agent reliability actually come from?. So the honest answer to your question is that the collection doesn't have a head-to-head test of interview-plus-survey vs. interview-alone — but it strongly predicts the result. Surveys would help only as a vehicle for new *factual* content, the interview already extracts most of that signal (which is why even its bullet-point summary holds up), and the marginal gain shrinks fast while the drift and contradiction risks grow. The thing you didn't know you wanted to know: the interview's edge isn't that it's a conversation — it's that talking gets people to volunteer facts a survey form never thought to ask for.
Sources 5 notes
A 1,052-person study found agents built from voice interviews replicated participant responses nearly as well as people replicate their own answers. Factual content, not linguistic style, drove this accuracy—even summary bullet points retained 83% fidelity.
Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.