How should agents split planning from visual grounding? · Gravity7

Tool Calling and Function-Call Architectures

4 notes

Where do traditional function calling systems actually break down?

Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

GUI Agents and Visual UI Understanding

5 notes

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Why do vision-only GUI agents struggle with screen interpretation?

Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.

Do text-based GUI agents actually work in the real world?

Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.

Can unlabeled UI video teach models what users intend?

Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.

Agentic Memory Variants

13 notes

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Can agents learn reusable sub-task routines from past experience?

Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.

Can frozen language models continually improve through memory structure alone?

If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Can agents learn better from their failures than successes?

Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.

How should agent memory split across time scales?

Explores whether agent working memory should be organized by temporal scope—some components persisting across a conversation, others refreshed each turn. Understanding this distinction could reveal why some memory designs fail.

Does agent memory degrade when continuously consolidated?

Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.

Should successful and failed episodes be processed differently?

Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.

Can LLMs learn reliably at test time without human oversight?

How can language models adapt to rapidly changing rules and knowledge during inference rather than waiting for retraining? What prevents fully autonomous systems from handling conflicting information?

Can semantic capability vectors replace manual agent routing?

Explores whether embedding agent capabilities in high-dimensional space and matching them semantically can eliminate brittle, manually-maintained topic-based routing in multi-agent systems.

Can agents adapt without pausing service to users?

Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.

Can a separate trained curator improve skill libraries better than frozen agents?

Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.

Agent Training and Environment Design

1 note

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Agent Economy and Interaction Scaling

2 notes

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Agent Efficiency and System-Level Optimization (2026-05-18 batch C)

3 notes

Does agent efficiency really break down into three distinct components?

Can we understand agent efficiency as three independent optimization problems—memory, tool use, and planning—each with separate cost drivers? This matters because it could explain why point optimizations keep missing the bigger picture.

Why does agent efficiency differ from model size reduction?

Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.

Do efficiency techniques across agent components reveal shared structural constraints?

Despite targeting different parts of agentic systems, efficiency techniques converge on similar principles. This raises a question: are these convergences independent discoveries, or do they reflect deeper architectural constraints that all agent systems face?

Long-Horizon Agent Architecture (2026-05-18 batch C)

3 notes

Can agents compress their own memory without losing critical details?

Explores whether agents can autonomously consolidate interaction history into structured memory schemas that reduce token overhead while preserving information needed for long-horizon reasoning and strategic reflection.

Can agents discover tools dynamically instead of pre-selecting them?

Explore whether agents can find needed tools during execution rather than choosing from a fixed set upfront. This matters for long-horizon tasks where relevant tools cannot be known in advance.

Can simulated APIs and token-level credit assignment train better tool-using agents?

Training agents to use real APIs is expensive and unstable, and sparse rewards make it hard to credit the right tool calls. Can combining LLM simulators with fine-grained advantage attribution solve both problems?

Privacy Contract for Agent Deployment (2026-05-18 batch C)

1 note

Can a two-category privacy boundary actually be auditable?

Most privacy frameworks are either too vague or too complex for agent deployment. Can a minimal binary split—LOW versus HIGH data categories—provide enough clarity for both users and automated compliance auditing?

Vibe Coding and Hybrid Workflows

2 notes

Does vibe coding actually keep humans in the loop?

Vibe coding claims to keep developers steering and validating, but do novices actually engage with code and testing the way the tool design assumes? The gap between intended and actual behavior could compound failures.

Where do vibe coding students actually spend their debugging time?

When novices use AI coding tools, do they engage with the code itself, or do they primarily test the prototype? Understanding where students focus reveals how AI-assisted coding shapes learning behavior.

Code as Agent Harness (2026-05-28)

6 notes

Can code become the operational substrate for agent reasoning?

Explores whether code, beyond being an LLM output, functions as the primary medium through which agents reason, act, observe, and verify progress in complex tasks.

How do model capabilities differ from harness infrastructure in agents?

What distinct layers make up an agentic system, and how do failures in each layer differ? Understanding this decomposition helps pinpoint whether problems stem from the model, the infrastructure, or the agent's own code.

What makes agent-created code artifacts so hard to manage?

Agent-authored code that persists and is shared across systems raises difficult questions about what should be kept versus discarded, and how to maintain consistent state when multiple agents collaborate on the same artifacts.

Does creating skills inside the agent loop eliminate mismatches?

Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.

Does constraining edits help agents improve their own skills?

When agents rewrite their own instructions, does freedom to edit lead to better learning, or do safeguards like edit budgets and memory of failures produce more stable improvement?

Can skill documents be optimized like neural network weights?

Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?

Harness as Locus of Capability — Batch #3 backlog (2026-06-03)

5 notes

Can externalizing bookkeeping improve search agent performance?

Does moving routine state management out of the policy and into a stateful environment harness free reinforcement learning to focus on genuine semantic decisions? This explores whether division of labor between environment and model improves search efficiency.

Does raw token spending actually predict agent performance?

Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.

Can external managers compress context better than frozen agents?

Explores whether offloading context management to a trained external system can adapt compression strategies to individual agent strengths, rather than forcing agents to manage their own context constraints.

Can agents fail from weak memory control rather than missing knowledge?

As multi-turn agent workflows grow longer, performance degrades—but is this due to insufficient context or poor memory management? This explores whether memory *control* is the real bottleneck.

Why do RL agents exploit before exploring enough?

Standard task-oriented RL rewards immediate task completion over environment discovery. This may systematically under-train the exploration skills needed for unfamiliar environments.

Reading-memory and screen understanding — Batch #3 backlog wave 2 (2026-06-03)

2 notes

Can LLMs read long documents like humans do?

How might mimicking human reading strategies—storing gist memories and looking up details on demand—help language models handle documents beyond their effective context window?

Can one model understand both UIs and infographics equally well?

Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?

API-grounded workflow generation — Batch #5 backlog (2026-06-03)

1 note

Can LLMs generate workflows without touching proprietary data?

Explores whether LLMs can orchestrate task automation by composing API calls rather than directly accessing confidential information, and whether this approach preserves security while handling unpredictable tasks.

Related Areas

4 notes

Why do multi-agent systems fail despite individual capability?

Multi-agent systems show lower performance than individual models despite coordinating multiple reasoning instances. What structural failures emerge when multiple LLMs deliberate together, and what ecosystem conditions are required for effective autonomous cooperation?

What breaks when specialized AI models reach real users?

When domain-specific AI systems move from research to production, deployment patterns, routing decisions, and interface design all shape whether users can actually complete tasks. Understanding these friction points reveals where specialized models fail in practice.

How should reasoning systems actually be architected?

This explores the fundamental design choices for building reasoning into AI systems—from when to activate reasoning versus how to execute it, to whether reasoning must be verbal or can happen in latent space.

Why do multi-agent systems fail despite individual capability?

Multi-agent systems show lower performance than individual models despite coordinating multiple reasoning instances. What structural failures emerge when multiple LLMs deliberate together, and what ecosystem conditions are required for effective autonomous cooperation?