What extraction errors most reliably propagate through knowledge graph traversal?
This explores which mistakes made when facts are first pulled into a knowledge graph tend to compound — rather than wash out — as a system walks the graph for multi-hop reasoning.
This reads the question as being about error propagation: not which extraction mistakes are most common, but which ones survive and amplify once you start traversing a graph hop by hop. The corpus doesn't have a paper that names "extraction errors in KG traversal" head-on, but several notes triangulate the answer from different directions, and they converge on a consistent picture: the errors that propagate worst are the ones that are silent and structural, not the ones that are loud and factual.
The clearest signal on compounding comes from work on long delegated workflows, where frontier models silently corrupt about a quarter of document content across extended relay tasks — and crucially, the errors don't plateau through 50 round-trips, they keep accumulating (Do frontier LLMs silently corrupt documents in long workflows?). Graph traversal is structurally the same situation: each hop conditions on the output of the last, so an extraction error that goes undetected at step one becomes the trusted premise of step two. The failure mode that propagates is the one nothing checks.
What makes an error invisible? Two notes suggest the culprit is structural rather than topical similarity. A verification pipeline built specifically to catch "structural near-misses" — things that look topically right but bind the wrong entities — succeeds only because it inspects full token-to-token interaction patterns; compressed-vector similarity (the kind extraction usually relies on) waves these through (Can verification separate structural near-misses from topical matches?). And LLM extraction itself degrades predictably as syntactic depth increases: models reliably misread embedded clauses, complex nominals, and nested verb phrases (Why do large language models fail at complex linguistic tasks?). So the most propagation-prone extraction error is a relational one — attaching a relation to the wrong entity inside a complex sentence — precisely because it produces a graph edge that is locally plausible and only wrong in context.
There's a representational dimension too. When extraction forces multi-entity facts into pairwise edges, the joint constraint binding three-or-more entities is lost at extraction time, and no amount of careful traversal can reconstruct it — which is the whole argument for hyperedges that keep the constraint intact (Can hypergraphs capture multi-hop reasoning better than graphs?). This is the most insidious class: the error isn't a wrong fact, it's a dropped constraint, so traversal happily combines decomposed fragments into conclusions the original evidence never supported. Approaches that align reasoning to explicit graph topology rather than semantic guesswork (Can symbolic rules from knowledge graphs guide complex reasoning?) help here, but only if the topology was extracted correctly in the first place.
Two more notes explain why traversal can't self-rescue. Selective, learned traversal trades certainty about the full graph for tractable navigation — it never sees the whole structure, so it can't notice that a path rests on a corrupt edge (Can learned traversal policies beat exhaustive graph reading?). And there's a behavioral amplifier: models trained toward agreement will accommodate a false premise rather than challenge it (Why do language models agree with false claims they know are wrong?), meaning a wrong extracted edge isn't just passed along — it's actively defended downstream. The unexpected takeaway: the dangerous errors aren't hallucinated facts (those often get caught), but quietly mis-bound relations and silently dropped joint constraints — locally plausible, globally false, and invisible to exactly the similarity-based machinery that built the graph.
Sources 7 notes
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
HGMem organizes retrieved evidence as hyperedges rather than flat lists or binary graphs, allowing three or more entities to bind into single relations without decomposition. This structure accumulates coherent knowledge across retrieval steps, trading representational complexity for constraint expressiveness.
SymAgent derives symbolic rules from KG structure using LLM reasoning to create navigational plans that align natural language with graph topology. This approach captures structural reasoning patterns explicitly, outperforming retrieval methods that rely on semantic similarity alone.
Graph-O1 replaces whole-graph ingestion with step-by-step agentic navigation using Monte Carlo Tree Search and reinforcement learning. This approach fits within LLM context windows while learning domain-specific traversal policies, though it trades certainty about the full graph for decision-making under uncertainty.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.