← All notes

How should agents split planning from visual grounding?

Patterns in how autonomous agents use tools, GUIs, and functions through planning-grounding factoring and memory design.

Topic Hub · 52 linked notes · 14 sections
View as

Tool Calling and Function-Call Architectures

4 notes

Where do traditional function calling systems actually break down?

Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.

Explore related Read →

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Explore related Read →

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.

Explore related Read →

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

Explore related Read →

GUI Agents and Visual UI Understanding

5 notes

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Explore related Read →

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Explore related Read →

Why do vision-only GUI agents struggle with screen interpretation?

Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.

Explore related Read →

Do text-based GUI agents actually work in the real world?

Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.

Explore related Read →

Can unlabeled UI video teach models what users intend?

Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.

Explore related Read →

Agentic Memory Variants

13 notes

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Explore related Read →

Can agents learn reusable sub-task routines from past experience?

Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.

Explore related Read →

Can frozen language models continually improve through memory structure alone?

If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?

Explore related Read →

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

Explore related Read →

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Explore related Read →

Can agents learn better from their failures than successes?

Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.

Explore related Read →

How should agent memory split across time scales?

Explores whether agent working memory should be organized by temporal scope—some components persisting across a conversation, others refreshed each turn. Understanding this distinction could reveal why some memory designs fail.

Explore related Read →

Does agent memory degrade when continuously consolidated?

Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.

Explore related Read →

Should successful and failed episodes be processed differently?

Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.

Explore related Read →

Can LLMs learn reliably at test time without human oversight?

How can language models adapt to rapidly changing rules and knowledge during inference rather than waiting for retraining? What prevents fully autonomous systems from handling conflicting information?

Explore related Read →

Can semantic capability vectors replace manual agent routing?

Explores whether embedding agent capabilities in high-dimensional space and matching them semantically can eliminate brittle, manually-maintained topic-based routing in multi-agent systems.

Explore related Read →

Can agents adapt without pausing service to users?

Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.

Explore related Read →

Can a separate trained curator improve skill libraries better than frozen agents?

Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.

Explore related Read →

Agent Training and Environment Design

1 note

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Explore related Read →

Agent Economy and Interaction Scaling

2 notes

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

Explore related Read →

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Explore related Read →

Agent Efficiency and System-Level Optimization *(2026-05-18 batch C)*

3 notes

Does agent efficiency really break down into three distinct components?

Can we understand agent efficiency as three independent optimization problems—memory, tool use, and planning—each with separate cost drivers? This matters because it could explain why point optimizations keep missing the bigger picture.

Explore related Read →

Why does agent efficiency differ from model size reduction?

Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.

Explore related Read →

Do efficiency techniques across agent components reveal shared structural constraints?

Despite targeting different parts of agentic systems, efficiency techniques converge on similar principles. This raises a question: are these convergences independent discoveries, or do they reflect deeper architectural constraints that all agent systems face?

Explore related Read →

Long-Horizon Agent Architecture *(2026-05-18 batch C)*

3 notes

Can agents compress their own memory without losing critical details?

Explores whether agents can autonomously consolidate interaction history into structured memory schemas that reduce token overhead while preserving information needed for long-horizon reasoning and strategic reflection.

Explore related Read →

Can agents discover tools dynamically instead of pre-selecting them?

Explore whether agents can find needed tools during execution rather than choosing from a fixed set upfront. This matters for long-horizon tasks where relevant tools cannot be known in advance.

Explore related Read →

Can simulated APIs and token-level credit assignment train better tool-using agents?

Training agents to use real APIs is expensive and unstable, and sparse rewards make it hard to credit the right tool calls. Can combining LLM simulators with fine-grained advantage attribution solve both problems?

Explore related Read →

Privacy Contract for Agent Deployment *(2026-05-18 batch C)*

1 note

Can a two-category privacy boundary actually be auditable?

Most privacy frameworks are either too vague or too complex for agent deployment. Can a minimal binary split—LOW versus HIGH data categories—provide enough clarity for both users and automated compliance auditing?

Explore related Read →

Vibe Coding and Hybrid Workflows

2 notes

Does vibe coding actually keep humans in the loop?

Vibe coding claims to keep developers steering and validating, but do novices actually engage with code and testing the way the tool design assumes? The gap between intended and actual behavior could compound failures.

Explore related Read →

Where do vibe coding students actually spend their debugging time?

When novices use AI coding tools, do they engage with the code itself, or do they primarily test the prototype? Understanding where students focus reveals how AI-assisted coding shapes learning behavior.

Explore related Read →

Code as Agent Harness *(2026-05-28)*

6 notes

Can code become the operational substrate for agent reasoning?

Explores whether code, beyond being an LLM output, functions as the primary medium through which agents reason, act, observe, and verify progress in complex tasks.

Explore related Read →

How do model capabilities differ from harness infrastructure in agents?

What distinct layers make up an agentic system, and how do failures in each layer differ? Understanding this decomposition helps pinpoint whether problems stem from the model, the infrastructure, or the agent's own code.

Explore related Read →

What makes agent-created code artifacts so hard to manage?

Agent-authored code that persists and is shared across systems raises difficult questions about what should be kept versus discarded, and how to maintain consistent state when multiple agents collaborate on the same artifacts.

Explore related Read →

Does creating skills inside the agent loop eliminate mismatches?

Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.

Explore related Read →

Does constraining edits help agents improve their own skills?

When agents rewrite their own instructions, does freedom to edit lead to better learning, or do safeguards like edit budgets and memory of failures produce more stable improvement?

Explore related Read →

Can skill documents be optimized like neural network weights?

Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?

Explore related Read →

Harness as Locus of Capability — Batch #3 backlog *(2026-06-03)*

5 notes

Can externalizing bookkeeping improve search agent performance?

Does moving routine state management out of the policy and into a stateful environment harness free reinforcement learning to focus on genuine semantic decisions? This explores whether division of labor between environment and model improves search efficiency.

Explore related Read →

Does raw token spending actually predict agent performance?

Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.

Explore related Read →

Can external managers compress context better than frozen agents?

Explores whether offloading context management to a trained external system can adapt compression strategies to individual agent strengths, rather than forcing agents to manage their own context constraints.

Explore related Read →

Can agents fail from weak memory control rather than missing knowledge?

As multi-turn agent workflows grow longer, performance degrades—but is this due to insufficient context or poor memory management? This explores whether memory *control* is the real bottleneck.

Explore related Read →

Why do RL agents exploit before exploring enough?

Standard task-oriented RL rewards immediate task completion over environment discovery. This may systematically under-train the exploration skills needed for unfamiliar environments.

Explore related Read →

Reading-memory and screen understanding — Batch #3 backlog wave 2 *(2026-06-03)*

2 notes

Can LLMs read long documents like humans do?

How might mimicking human reading strategies—storing gist memories and looking up details on demand—help language models handle documents beyond their effective context window?

Explore related Read →

Can one model understand both UIs and infographics equally well?

Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?

Explore related Read →

API-grounded workflow generation — Batch #5 backlog *(2026-06-03)*

1 note

Can LLMs generate workflows without touching proprietary data?

Explores whether LLMs can orchestrate task automation by composing API calls rather than directly accessing confidential information, and whether this approach preserves security while handling unpredictable tasks.

Explore related Read →