TOPIC

Tool Use and Computer-Use Agents

16 synthesis notes · 42 source papers

View as

How do model capabilities differ from harness infrastructure in agents?

What distinct layers make up an agentic system, and how do failures in each layer differ? Understanding this decomposition helps pinpoint whether problems stem from the model, the infrastructure, or the agent's own code.

What makes agent-created code artifacts so hard to manage?

Agent-authored code that persists and is shared across systems raises difficult questions about what should be kept versus discarded, and how to maintain consistent state when multiple agents collaborate on the same artifacts.

Can code become the operational substrate for agent reasoning?

Explores whether code, beyond being an LLM output, functions as the primary medium through which agents reason, act, observe, and verify progress in complex tasks.

Can small models match large models on function calling?

Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.

Can structured reasoning replace code execution for RL rewards?

Can semi-formal templates enable execution-free code verification reliable enough to train RL agents without running code? This matters because execution is expensive and slow in agent training loops.

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Can breaking function calling into subtasks improve model generalization?

Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Can unlabeled UI video teach models what users intend?

Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Can structured templates make code reasoning more reliable than free-form thinking?

Unstructured chain-of-thought reasoning lets models skip cases and make unsupported claims. This explores whether semi-formal templates requiring explicit premises, evidence traces, and alternative checks can prevent these failure modes.

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

Can structured templates replace formal verification for code reasoning?

Formal verification is rigorous but impractical at repository scale. Can natural-language templates with enforced structure provide the same reliability guarantees without the formalization cost? This explores the middle ground between unstructured reasoning and full formalism.

Source papers 42

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
However, building production-grade agentic AI workflows remains challenging. While prototypes are easy to build with simple scripts or notebooks, scaling them into reliable, governed, and observable s…
AI Agent Traps
As autonomous AI agents increasingly navigate the web, they face a novel challenge: the information environment itself. This gives rise to a critical vulnerability we refer to as "AI Agent Traps" — ad…
Adaptation of Agentic AI
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these sys…
Agent A/B: Automated and Scalable A/B Testing on Live Websites with Interactive LLM Agents
A/B testing is central to UI/UX design, yet our formative study with six industry practitioners revealed that it is slowed by scarce user traffic, long runtimes, and high operational costs. To address…
Agent S: An Open Agentic Framework that Uses Computers Like a Human
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automatin…
Agent Workflow Memory
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contr…
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either …
Agentic Code Reasoning
Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a stru…
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Agentic Reasoning, a framework1 that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on i…
Agentic Web: Weaving the Next Web with AI Agents
This challenge gives rise to the notion of the Agent Attention Economy. In a manner analogous to the early Web’s competition for user clicks, external services now compete to be selected and invoked b…
AutoGLM: Autonomous Foundation Agents for GUIs
While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence.…
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling
Large Language Models (LLMs) have shown remarkable capabilities in various domains, yet their economic impact has been limited by challenges in tool use and function calling. This paper introduces Tho…
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search t…
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs…
Code as Agent Harness
Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agent…
DeepAgent: A General Reasoning Agent with Scalable Toolsets
Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow p…
Do Phone-Use Agents Respect Your Privacy?
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-us…
Efficient Tool Use with Chain-of-Abstraction Reasoning
To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools…
Equipping agents for the real world with Agent Skills
Published Oct 16, 2025 Claude is powerful, but real work requires procedural knowledge and organizational context. Introducing Agent Skills, a new way to build specialized agents using files and folde…
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover in…
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex t…
Improving Generalization in Task-oriented Dialogues with Workflows and Action Plans
![A screenshot of a computer program](/assets/paper-images/ImprovingGeneralizationInTaskOrientedDialogues.png) Task-oriented dialogue is difficult in part because it involves understanding user inten…
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
This study introduces a novel framework for training smaller language models in function calling, focusing on specific logical and mathematical reasoning tasks. The approach aims to improve performanc…
Insert-expansions For Tool-enabled Conversational Agents
“This paper delves into an advanced implementation of Chain-of-Thought-Prompting in Large Language Models, focusing on the use of tools (or "plug-ins") within the explicit reasoning paths generated by…
LLMs Corrupt Your Documents When You Delegate
Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust—the expectation tha…
Large Language Model-Brained GUI Agents: A Survey
![A computer screen with text and images with medium confidence](/assets/paper-images/LargeLanguageModelBrainedGUIAgentsASurvey.png) Abstract—Graphical User Interfaces (GUIs) have long been central …
Large Multimodal Agents: A Survey
Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities akin to humans. Concurrently, there is an e…
MCP-Zero: Proactive Toolchain Construction for LLM Agents from Scratch
Function-calling has enabled large language models (LLMs) to act as tool-using agents, but injecting thousands of tool schemas into the prompt is costly and error-prone. We introduce MCP-Zero, a proac…
Magentic-UI: Towards Human-in-the-loop Agentic Systems
AI agents powered by large language models are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of humanlevel performance in m…
Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support
Flow Theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task’s difficulty aligns with their skill level. In AI-augmented reasoning, int…
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms—from static imitation to incentive-driven decision mak…
OmniParser for Pure Vision Based GUI Agent
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a g…
Opportunities for large language models and discourse in engineering design
In this paper, we argue that foundation models such as LLMs can be used for creative reasoning tasks in the engineering design process, complementing and integrating existing computational methods suc…
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
“There is a trending paradigm[1; 2; 3; 4; 5; 6; 7; 8] to couple large language models (LLMs) with external plugins or tools, enabling LLMs to interact with environment [9; 10] and retrieve up-to-date …
Real-Time Procedural Learning From Experience for AI Agents
AI agents are artificial intelligence systems capable of observing and taking actions in an environment. As adoption spreads across industries, there is a growing need for AI agents to quickly learn d…
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-ric…
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self…
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
Recently, the rapid development of Large Language Models (LLMs) has opened new frontiers in table processing (Li et al., 2023b) and reasoning (Cheng et al., 2022). However, spreadsheets pose unique ch…
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces (“thinking” more) before producing a response. In agent problems that require interaction, this can be do…
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis pro…
Toward Efficient Agents: A Survey of Memory, Tool Learning, and Planning
Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency — which is crucial for r…
UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
Inspired by the success of self-supervised learning (SSL) techniques like Joint Embedding Predictive Architectures (JEPA) [10] and its variants [3, 5], we propose UI-JEPA, a lightweight video-to-text …