INQUIRING LINE

Why might text-only interfaces underestimate agent preference elicitation capabilities?

This explores why judging an agent's ability to learn what users want by watching it chat in text may sell it short — because preference elicitation happens through channels the text box never sees: observation, richer interfaces, and active questioning.


This reads the question as being about measurement bias: if you only ever watch an agent infer preferences through a text chat window, you may conclude it's weak at preference elicitation — when in fact the text box is the bottleneck, not the agent. The corpus suggests the limitation lives in the interface, not the capability.

Start with the assumption baked into 'text-only': that eliciting a preference means asking for it in words. But agents can learn preferences by watching rather than asking. The M3-Agent work shows that an entity-centric memory graph fed by continuous multimodal observation lets an agent infer and act on what a user wants without ever posing a question Can agents learn preferences by watching rather than asking?. A text-only setup is blind to exactly this — the passive, ambient signal — so it can only measure the narrow slice of elicitation that survives being typed out.

Text is also a thin and passive channel in its own right. Conversational agents are structurally passive: their training optimizes for responding, not for leading, so in a pure chat setting they won't proactively probe for what they don't yet know Why can't conversational AI agents take the initiative?. And when you compare interfaces head-to-head, users prefer generated task-specific UIs over text blocks in more than 70% of cases — structured, interactive surfaces let people express and refine intent that a wall of text muddies Do generated interfaces outperform text-based chat for most tasks?. A dashboard with sliders surfaces preferences that the same user would never volunteer in prose.

The same loss-of-signal shows up at the perception layer. Text-based GUI agents that read a page as HTML or an accessibility tree miss what humans actually see; real grounding needs vision, not a flattened text transcript of the screen Do text-based GUI agents actually work in the real world?. Preference cues — what a user lingers on, what they click — live in that perceptual richness, and a text-only evaluation discards them before the agent gets a chance.

Worth noticing is how cheaply elicitation can work once you give it the right channel. PReF shows that just ten well-chosen adaptive questions can pin down a personalized reward function through active learning Can user preferences be learned from just ten questions?, and conversation-analysis 'insert-expansions' give a formal account of when an agent should pause to ask versus quietly proceed When should AI agents ask users instead of just searching?. The takeaway you might not have expected: the agent's real preference-elicitation ability is a product of the modality it's allowed to use — abstract semantic preference summaries also beat replaying raw past chats Does abstract preference knowledge outperform specific interaction recall? — so a text-only interface doesn't just limit the agent, it quietly hides how good it could be.


Sources 7 notes

Can agents learn preferences by watching rather than asking?

M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.

Why can't conversational AI agents take the initiative?

Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.

Do generated interfaces outperform text-based chat for most tasks?

Research shows users strongly prefer LLM-generated interactive interfaces—dashboards, tools, animations—over text blocks, especially for structured and information-dense tasks. Structured representation and iterative refinement reduce cognitive load.

Do text-based GUI agents actually work in the real world?

ShowUI demonstrates that GUI agents need end-to-end vision-language-action models with UI-aware token selection and interleaved streaming, not adapted general-purpose MLLMs. Standard multimodal models lack the grounding and action capabilities real interface navigation demands.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about agent preference elicitation in text-only vs. multimodal interfaces. The question: *Do text-only interfaces genuinely underestimate agent preference elicitation capability, or have recent advances in LLM reasoning, orchestration, or evaluation methods flattened that gap?*

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026; treat as perishable anchors.
• Multimodal observation (entity-centric memory graphs) lets agents infer preferences passively without asking; text-only setups miss ambient signals (~2024).
• Conversational agents are structurally passive (trained to respond, not lead); text chat alone won't elicit what users don't volunteer (~2024).
• Users prefer dynamically generated task-specific UIs over text blocks in >70% of cases; structured interfaces surface preferences that prose obscures (~2025).
• Text-based GUI agents reading HTML/accessibility trees lose perceptual grounding; preference cues (dwell, clicks) live in vision, not transcripts (~2024).
• Adaptive active learning (ten well-chosen questions via reward factorization) can pin down personalized reward functions cheaply (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2307.01644 (2023) — Insert-expansions for conversational agents
• arXiv:2508.19227 (2025) — Generative Interfaces for Language Models
• arXiv:2503.06358 (2025) — Reward Factorization for LM Personalization
• arXiv:2604.08224 (2026) — Externalization in LLM Agents (memory, harness)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, Claude 3.7, reasoning-scaled variants), in-context learning, chain-of-thought scaffolding, or novel evaluation harnesses have since RELAXED the text-only bottleneck — i.e., can a pure text interface now recover multimodal preference signals through linguistic reasoning alone? Separate the durable claim (multimodal > unimodal for preference elicitation) from perishable limitation (text chat can't do it). Cite what resolved or still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Have any papers shown that text-only agents with in-context examples, retrieval-augmented thought, or chain-of-thought can match or exceed structured UI performance on preference discovery?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *Does reasoning-scale LLM introspection allow text agents to self-prompt for preference gaps without external UI scaffolding?* *Can semantic memory abstraction in text chat recover 80%+ of the preference signal that vision+click data provides?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines