How should ground truth labels be assigned to simulated user sessions?
This explores where the 'correct answer' for a synthetic user conversation should come from — whether labels are annotated after the fact, baked in at generation time, or derived statistically without any annotation at all.
This explores where the 'correct answer' for a synthetic user conversation should come from — and the corpus's most useful move is to question the premise that labels are something you assign *afterward* at all. When you build a simulator by conditioning it on explicit latent variables — a user profile at the session level and an intent at the turn level — those variables *are* the ground truth. RecLLM shows you don't label the session, you author it: the profile and intent you injected become the target the rest of the pipeline is measured against, and realism is checked by whether discriminators and classifiers can tell synthetic from real Can controlled latent variables make LLM user simulators realistic?. Layered diversity work pushes the same idea further — subtopic, Big Five persona, and contextual characteristics are dialed in as generation parameters, so the controllable knobs double as the labels Can synthetic dialogues become realistic through layered diversity?.
The second route abandons external annotation entirely. Test-Time RL produces reward signals by majority vote across repeated samples — consensus stands in for ground truth, and it works because agreed-upon answers tend to be right, creating a bootstrapping loop Can models improve themselves using only majority voting?. A related trick reuses a single self-supervised statistic — cross-rollout variance — both to weight tokens and to throw out degenerate queries, which matters precisely on the unverifiable tasks where no clean label exists Can one statistical measure serve dual purposes in RL training?. For simulated *sessions* specifically, the most direct example is inverting RL to train the simulator itself: persona consistency becomes the label, scored three ways — prompt-to-line, line-to-line, and Q&A consistency — which catches local drift, global drift, and factual contradiction as distinct error types Can training user simulators reduce persona drift in dialogue?.
But the corpus also plants a warning sign that should change how you trust any label you assign. When one model secretly controls every participant, simulations look competent — and that competence is an artifact. LLMs collapse the moment agents hold private information, because the omniscient setup lets them skip the grounding work real conversation requires Why do LLMs fail when simulating agents with private information?. So a 'ground truth label' derived from an all-knowing simulator may be labeling a conversation that could never happen under real information asymmetry. The same skepticism applies to surface competence generally: models default to shallow strategies that pass structured tests but fail open-ended perspective-taking, so a label that only checks the structured case will certify the wrong thing Do large language models genuinely simulate mental states?.
The quieter lesson runs underneath all of this: a label is a draw from a distribution, not a fact. Zero temperature and fixed seeds reproduce the same output every time, but that consistency isn't reliability — you've frozen one sample, not found the truth Does setting temperature to zero actually make LLM outputs reliable?. Taken together, the corpus suggests a layered answer rather than a single method: encode ground truth as the latent variables you generate from, validate it with discriminators or consensus rather than a single annotator, score sessions on consistency across turns, and treat any label from an omniscient or low-information-asymmetry simulator as suspect until you've confirmed it survives the grounding work real users force.
Sources 8 notes
RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.
Research shows that realistic synthetic dialogues require three multiplicative layers: subtopic specificity, Big Five persona variation, and 11 contextual characteristics via Chain of Thought reasoning. This structured approach captures 90.48% of in-domain dialogue performance.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.