How do minimal-disclosure privacy contracts enable multi-dimensional agent evaluation?
This explores why evaluating agents on what they *don't* reveal — keeping private data disclosed only when strictly needed — forces us to grade them on several separate axes at once, not just whether they finished the task.
This question is really about a shift in how we score agents: instead of asking only 'did it complete the task?', we ask 'did it complete the task *while* withholding what shouldn't be shared *and* reusing what it was allowed to remember?' The sharpest evidence that these are genuinely different things comes from phone-agent testing, where task success, privacy-compliant completion, and reuse of saved preferences turn out to be statistically independent capabilities — no single model wins all three, and a success-only leaderboard tells you almost nothing about how an agent handles private information Do phone agents succeed at all three critical tasks equally?. That independence is the whole argument for multi-dimensional evaluation: collapse it to one number and you hide exactly the failures you most need to see.
Why is minimal disclosure the lever that exposes this? Because privacy leakage isn't a surface slip an agent can bolt a filter onto afterward — it's woven into how the model thinks. Nearly three-quarters of privacy leaks in reasoning traces come from the model directly materializing sensitive user data mid-thought, and the longer it reasons the more it leaks; worse, anonymizing the trace after the fact degrades the model's usefulness, because the private data was functioning as cognitive scaffolding Do reasoning traces actually expose private user data?. So a contract that says 'reveal only what the task requires' can't be checked by reading the final output alone — you have to evaluate the process, which is inherently a second dimension beyond task success.
The tension deepens because the very things that make agents useful also make them leak. Personalization simultaneously raises trust and privacy risk along the same curve — each remembered detail builds rapport and enlarges the exposure surface Does chatbot personalization build trust or expose privacy risks?. That's the same trade-off MyPhoneBench operationalizes as 'preference reuse' versus 'privacy compliance': an agent that aggressively reuses saved preferences scores well on helpfulness and badly on disclosure, and you only catch that conflict if you're measuring both at once.
A related blind spot is what evaluation settings quietly assume. When one model secretly controls every party in a simulation, agents look socially competent — but that competence collapses under genuine information asymmetry, because the omniscient setup let the model skip the grounding work of reasoning about what others *don't* know Why do LLMs fail when simulating agents with private information?. Minimal-disclosure contracts force that asymmetry back into the test: the agent must act without seeing everything, which is the only condition under which privacy behavior is even meaningful to measure.
Where does the contract live so the agent actually honors it? The most durable answer here is that governance works when it's embedded in the agent's runtime memory layer rather than appended as an external policy — a long-running agent consulted its encoded safeguards during actual decisions, making them effective precisely because they were *in the loop* it read from Can governance rules embedded in runtime memory actually protect autonomous agents?. Read together, these notes suggest minimal-disclosure contracts don't just add a privacy checkbox — they unbundle 'a good agent' into success, restraint, and memory-use, and they only become testable when the contract is enforced inside the agent's own thinking and the evaluation refuses to let any one axis stand in for the others.
Sources 5 notes
MyPhoneBench demonstrates that task success, privacy-compliant completion, and saved-preference reuse are statistically distinct capabilities with no model dominating all three. Success-only rankings do not predict privacy or preference performance.
74.8% of privacy leaks in language model reasoning traces result from models materializing sensitive user data during thought processes. Longer reasoning chains amplify leakage, and anonymizing traces post-hoc degrades model utility, suggesting private data functions as cognitive scaffolding.
Longitudinal research shows personalization enhances trust and anthropomorphism but also amplifies privacy concerns and escalating user expectations. One-shot studies miss these temporal dynamics—each interaction raises the baseline, making failures more disappointing.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.