How might automated evals eventually capture the human judgment designers exercise now?
This explores whether the 'taste' and judgment designers apply by hand today can be turned into machine-runnable evaluations — and what the corpus says about how that formalization happens, how far it reaches, and where it breaks down.
This explores whether designer judgment — the eye for what's good that feels irreducibly human — can be reified into automated evals, and the corpus reads that less as a far-off possibility than as a process already underway. The clearest framing comes from Will AI automation eventually formalize designer taste?, which argues automation always follows the same arc: a community names some capacity as the part machines can never touch, and then that capacity gets written down as a process and executed. Taste is being formalized right now through evaluation rubrics and preference data, which shifts the designer from the person who exercises judgment to the person who *authors the criteria* the machine applies. The interesting move isn't replacement — it's relocation.
The mechanism by which judgment becomes machine-legible is the most active research frontier here. The crude version — a single LLM scoring outputs — is unreliable; Can agents evaluate AI outputs more reliably than language models? shows that an agent that *collects evidence* before judging cuts evaluation drift by a hundredfold over a plain LLM judge. The same instinct shows up in reward modeling: Can reward models benefit from reasoning before scoring? and Can judges that reason about reasoning outperform classifier rewards? both find that judges which *reason* before scoring — producing a chain of thought about why something is good — beat judges that just classify. So the path from human judgment to automated eval isn't 'compress taste into a number,' it's 'teach the evaluator to deliberate.' That looks a lot more like how a designer actually decides.
If judgment is partly about *whose* standard you're applying, Can personas extracted from documents generalize across evaluation tasks? points at how that gets captured too: extracting stakeholder personas from real domain documents and staging a structured debate among them, so the eval reproduces the multiple perspectives a designer holds in their head rather than one flattened rubric. And Should interactive evaluation be designed as a unified paradigm? makes the meta-point — that getting this right is itself a design discipline, with explicit protocols, not a pile of disconnected benchmarks. The designer's judgment doesn't vanish; it migrates up a level into the architecture of the evaluation system.
But the corpus also marks a hard boundary, and this is where the answer gets interesting. Can AI replicate the communicative work experts do? argues that expert judgment is fundamentally *communicative* — it anticipates what an audience will accept as valid, not just what's correct — and that AI has no mechanism for this anticipatory social work. If that's right, evals can capture the verifiable surface of taste while missing the part that's about reading a room. There's a cautionary echo in Can imitating ChatGPT fool evaluators into thinking models improved?: models that imitate a confident style fool human evaluators while closing no real capability gap — meaning a badly-built eval can certify the *appearance* of judgment.
The stakes of getting this wrong are systemic. Can AI generate knowledge faster than humans can evaluate it? warns that when generation outpaces verification, and the verification tools are themselves AI, the whole system loses its footing — which is exactly the trap if automated evals replace rather than extend human judgment. The more constructive direction may be the one in Do reflection questions help people make better decisions with AI?: evals that don't just hand down a verdict but ask the designer reflection questions tend to produce better decisions than ones that only advise. The honest answer, then, is that automated evals will capture more of designer judgment than most designers expect — the deliberative, multi-perspective, criteria-authoring parts — while the communicative, audience-anticipating core stays stubbornly human, and the real design job becomes deciding which is which.
Sources 10 notes
Historical automation waves follow a pattern: practitioners identify a core human capacity as irreplaceable, then that capacity gets formalized into processes machines can execute. Taste is already being formalized through evaluation rubrics and preference data that AI applies, shifting the designer's role from executor to eval author.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.
Interactive evaluation should be treated as a principled paradigm with explicit protocols and reporting standards, not adopted as disconnected benchmarks. The distinction matters: designing interactive evaluation as a unified system prevents fragmentation and incomparability, while expanding what counts as evidence beyond final responses.
Expertise requires anticipating audience acceptability and social validity, not just retrieving information. AI lacks the mechanism to perform this communicative work, making its fluent output epistemically misleading despite its confident form.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
AI produces knowledge faster than human judgment can verify it, collapsing epistemic confidence just as monetary hyperinflation collapses purchasing power. The gap self-reinforces because evaluation tools are themselves AI-generated, trapping the system in acceleration.
A lab study of 80 participants found that thinking assistants combining reflection questions with advice significantly outperformed agents that only advised, only questioned, or did neither. Prioritizing Socratic questioning over authoritative answers enhanced cognitive outcomes.