Can workflow inspection catch attacks that bias planning signals?
Does inspecting the final workflow catch attacks that contaminate earlier planning stages? This matters because contamination laundered through the planner may look legitimate by the time the workflow exists.
A defense can only catch what it can see, and where it looks determines what it can catch. Because FLOWSTEER biases the planning signals from which the workflow is generated, any defense that inspects only the resulting workflow examines an artifact that is already compromised. The malicious intent has been laundered through the planner into legitimate-looking roles, dependencies, and routing — by the time the workflow exists, the contamination is no longer visibly malicious. This is why the paper introduces FLOWGUARD as an input-side defense: it strengthens the planning boundary by separating task, methodological, and framing intents, then reframes workflow-contaminating cues while preserving the original task objective, reducing malicious success by up to 34 percent without degrading prompt utility.
The general principle is about defense placement, not defense strength. Moving inspection upstream — to the point where intent is parsed but before organization is committed — catches a class of attack that downstream inspection structurally cannot. The counterpoint is that input-side defense risks false positives that suppress legitimate methodological guidance, which is exactly why FLOWGUARD separates intent types rather than filtering wholesale. This matters because it reframes MAS security as a question of where the trust boundary sits: the safest place to intervene is the boundary between instruction and organization, not the organization itself.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What separates good workflow design from poor workflow design?
- Why does workflow position amplify malicious signals downstream?
- What makes planning-time attacks structurally invisible to downstream inspection?
- How do workflow-inspecting defenses fail when contamination enters at planning time?
- Do legitimate task signals exploit the same position and framing vulnerabilities as attacks?
- Can human inspection of auto-generated workflows catch harmful or incorrect API compositions?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we defend RAG systems from corpus poisoning without retraining?
Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.
parallel principle that the right defense sits upstream of where the harm becomes visible
-
How do adversarial traps target different layers of AI agents?
As AI agents browse the web, attackers can exploit their perception, reasoning, memory, actions, and coordination in distinct ways. Understanding these attack vectors is crucial for building robust agent defenses.
locating defenses depends on which trap category an attack belongs to
-
Can prompt injection reshape multi-agent workflow without touching infrastructure?
Explores whether an attacker can manipulate how a planner assigns tasks and routes coordination purely through prompt crafting, without modifying agents, tools, or messages. This matters because it identifies a planning-time vulnerability most defenses miss.
same FLOWSTEER work; names the planning-time attack surface that this note argues downstream workflow inspection structurally cannot see
-
How does workflow position shape attack propagation in multi-agent systems?
Explores whether a malicious signal's influence depends on its injection point in a multi-agent graph, and how task-relevant framing makes downstream agents more likely to relay it without scrutiny.
explains the propagation mechanism that makes upstream contamination look legitimate by the time it reaches the workflow this note says is inspected too late
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- FLOWSTEER: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
- Can Large Language Models Really Improve by Self-critiquing Their Own Plans?
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
- Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
- Agent Workflow Memory
- Reasoning Models Don't Always Say What They Think
Original note title
defenses that inspect only the generated workflow miss attacks that bias the upstream planning signal