How does evaluating interaction trajectories change what we measure beyond correctness?
This explores what becomes measurable once you score the whole arc of an interaction — the moves, recoveries, and shape of a session — rather than just whether the final answer was right.
This explores what becomes measurable once you stop grading only the final answer and start scoring the whole path that got there. The corpus suggests the answer is: a surprising amount, and much of it is invisible to correctness alone. The clearest statement of the shift comes from agent evaluation, where scoring expands from a single final response to the full interaction sequence — and the new things you measure are process quality, recoverability (could it dig itself out of a bad state?), coordination, and robustness How should we evaluate agent behavior beyond final answers?. None of those show up in a right/wrong check on the last token; a model can land the correct answer through a brittle, lucky, or unrecoverable path, and trajectory evaluation is what exposes the difference.
The most striking finding is that the *shape* of an interaction carries real signal even with the words stripped out. A structure-only model that looks at how a conversation unfolds geometrically — turn rhythm, branching, trajectory — predicts user satisfaction at 68%, nearly matching a full-text classifier at 70%, and combining the two reaches 80% Can conversation shape predict whether it will work? Can conversation structure predict dialogue success better than content?. So 'did the user feel this worked?' turns out to be largely a property of the trajectory, not the content. That's a quality dimension correctness can't even see.
Go deeper into the trajectory and more dimensions appear that don't reduce to accuracy at all. Conversational DNA tracks four streams at once — linguistic complexity, emotional arc, topic coherence, and relevance — as things that evolve over time rather than as a single score Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?. In a therapy setting, the COMPASS work infers a 36-dimensional working-alliance score *per turn*, and catches things like persistent patient–therapist misalignment in suicidality cases that a session-level outcome would miss entirely Can we measure therapist-patient alliance from dialogue turns in real time?. And novelty research shows that what looks good in a single session decays predictably over repeated interactions — meaning the unit of measurement itself has to become the trajectory, or you'll generalize from a snapshot that lies Do chatbot relationships lose their appeal as novelty wears off?.
The lateral payoff: trajectory structure isn't only something to *measure* — it's something to *learn from*. The same structural features that let you evaluate a path can be converted into dense training signal. Methods like Tree-GRPO and ToolPO exploit tree topology and tool-call positions to manufacture step-level rewards, replacing hand-annotated process-reward models Can trajectory structure replace hand-annotated process rewards?. In-context learning shows the same logic from the other side: models need full or partial *trajectories* from the same environment, not isolated correct examples, to generalize sequential decisions — burstiness in the trajectory is the thing that teaches Why do trajectories matter more than individual examples for in-context learning?. There's even a corrective lesson here: imitation models that copy ChatGPT's confident style fool human raters on the surface while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?, and pure self-improvement stalls without external anchors Can models reliably improve themselves without external feedback? — both warnings that outcome-only or style-only evaluation rewards the wrong thing. What you didn't know you wanted to know: 'is this interaction any good?' is often better answered by its geometry and its recoverability than by whether the last answer was correct.
Sources 10 notes
Evaluation expands from single final answers to full interaction sequences, and scoring procedures must assess process quality, recoverability, coordination, and robustness. This pattern appears consistently across agent benchmarks, suggesting a unified design framework for trajectory-level evaluation.
A structure-only model analyzing conversation trajectory achieved 68% accuracy predicting satisfaction, nearly matching full-text LLM analysis at 70%. Combined structural and textual features reached 80%, showing that how conversations unfold geometrically captures interaction quality text-based classifiers miss.
TRACE achieved 68% accuracy predicting dialogue success from structural features alone, matching a 70% content-based baseline. A hybrid combining both reached 80%, suggesting how agents communicate rivals what they say.
Conversational DNA encodes four simultaneous dimensions—linguistic complexity, emotional trajectories, topic coherence, and conversational relevance—as temporal streams. The reverse Turing test finding showed expert assessments of AI diverged sharply, suggesting conversational structure shapes interpretation as much as content.
COMPASS maps dialogue turns onto WAI embeddings to produce 36-dimensional alliance scores per turn. Anxiety and depression show convergence in alliance metrics over time, while suicidality shows persistent misalignment between patient and therapist.
Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.