Can semantic routing couple similarity matching with resource constraints?
This explores whether routing-by-meaning can do two jobs at once — matching a query to the right resource by semantic similarity, while also respecting budgets, policies, and cost ceilings — rather than treating 'find the closest match' and 'stay within limits' as separate steps.
This explores whether routing-by-meaning can do two jobs at once: matching a query to the right resource by semantic similarity, while also honoring budgets and policy limits. The corpus says yes — and the most direct evidence treats the coupling as the whole point rather than an afterthought. Can semantic capability vectors replace manual agent routing? embeds versioned capability vectors into an HNSW index so that semantic matching and policy/budget constraints are evaluated together, making 'who can do this, affordably, under these rules' a single first-class lookup that scales sub-linearly as the pool of agents grows more varied. That's the literal answer to your question: similarity and constraint don't have to be two passes.
The reason this matters becomes clearer when you look at routing done purely for cost. Can routers select the right model before generation happens? shows routers that predict query difficulty *before* generating anything cut cost 40–50% by sending easy queries to a cheap model and hard ones to an expensive one — the resource constraint (don't pay frontier prices for trivial questions) is baked into the routing decision itself. Can routing beat building one better model? pushes the same lever from the similarity side: it clusters queries in embedding space and routes each cluster to its best-fit model, either beating GPT-5-medium outright or matching it at 27% lower cost. So you can dial the same mechanism toward accuracy or toward thrift — the coupling is a knob, not a fixed tradeoff.
There's a subtler form of 'resource constraint' worth knowing about: not just dollars, but cognitive load on the system. Can routing queries to task-matched structures improve RAG reasoning? routes queries to whichever knowledge *structure* fits the task — tables, graphs, algorithms, plain chunks — using a trained router, and grounds this in cognitive-fit theory: the wrong representation forces wasted work. That reframes 'resource constraint' as 'don't make the system do hard work it doesn't need to,' which is the same instinct as not over-paying for compute.
The quiet warning in the corpus is that similarity matching alone is a weaker signal than it looks. Where do retrieval systems fail and why? points out that embeddings measure *association*, not *relevance*, and that embedding dimension mathematically caps how many documents a system can distinguish — so a router leaning only on cosine similarity inherits those blind spots. This is exactly why the capability-vector approach pairs similarity with explicit policy and budget filters rather than trusting nearest-neighbor distance to carry the whole decision. The constraint layer isn't just about saving money; it's a correction for what semantic similarity can't see on its own.
The thread connecting all of these: routing is increasingly framed as a *selection* problem that's a stronger lever than building one bigger model — and the systems that work best are the ones where 'what fits' and 'what we can afford' are decided in the same breath.
Sources 5 notes
Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.
RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.