Can we distill LLM knowledge into graphs for real-time recommendations?

E-commerce needs sub-millisecond recommendations, but LLMs are too slow. Can we extract LLM insights offline into a knowledge graph that serves requests in production without sacrificing quality or explainability?

Synthesis note · 2026-05-03 · sourced from Recommenders Architectures

E-commerce recommendation has tight latency constraints — typically tens of milliseconds per request. Calling an LLM at request time is unacceptable for these systems. But LLMs have world knowledge that's expensive to extract from interaction data alone. For example, the relation "carnations are the official flower for Mother's Day gift" is hard to mine from clickstream data because customers don't explicitly say "I'm buying this for my mother." But an LLM trained on web text knows this relation directly.

LLM-PKG bridges the latency gap by distilling LLM knowledge offline into a product knowledge graph (PKG). At ingestion time, the LLM is given curated prompts about products, its responses are mapped to enterprise products, and the resulting relations populate the graph. At query time, the recommender uses the graph rather than the LLM — sub-millisecond traversal instead of seconds-long generation.

The hallucination risk is real and is treated as the central problem: LLMs invent relations that don't exist. The mitigation is rigorous evaluation and pruning before populating the graph. The graph is the safety boundary — only relations passing evaluation make it in.

The architecture pattern is general beyond e-commerce: when an LLM has knowledge a downstream system needs but the system can't tolerate LLM latency, distill the knowledge into a static structure (graph, table, embedding store) at offline time. The LLM operates as an offline knowledge-extractor; the production system operates on the extracted artifact. This decouples knowledge breadth (LLM provides) from inference latency (the structure provides). The trade-off is staleness — the graph reflects the LLM at extraction time, not later — but for slowly changing domains the trade-off is favorable.

Inquiring lines that use this note as a source 9

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 108 in 2-hop network ·medium cluster Open in graph ↗

Can we distill LLM knowledge into graphs for rea… Can smaller models outperform their LLM teachers w… Can graphs unify collaborative filtering and side … How can real-time recommendations stay responsive … Can community detection enable RAG systems to answ…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can smaller models outperform their LLM teachers with enough data? Explores whether student models trained on expanded teacher-generated labels can exceed teacher performance in production ranking tasks, and what data scale makes this possible.
extends: same offline-LLM-distillation-into-fast-runtime pattern, applied to KG construction rather than ranking
Can graphs unify collaborative filtering and side information? How might merging user-item interactions with item attributes into a single graph structure allow recommendation systems to capture collaborative and attribute-based signals together, rather than separately?
complements: KGAT is a KG-for-recommendation pattern using interaction-derived attributes; LLM-PKG uses LLM-derived attributes — same architectural family
How can real-time recommendations stay responsive and reproducible? In-session signals improve ranking accuracy, but requiring fresh data during sessions forces real-time computation. This creates latency, network sensitivity, and debugging challenges that offset the relevance gains.
exemplifies: latency constraints driving offline-distillation is the production-side response to the freshness-latency tradeoff
Can community detection enable RAG systems to answer global corpus questions? Standard RAG struggles with corpus-wide questions that require understanding overall themes rather than retrieving specific passages. Can graph community detection overcome this limitation at scale?
complements: GraphRAG distills LLM knowledge into a query-time graph; LLM-PKG distills it into a recommend-time graph — same offline-LLM-into-graph pattern at different downstream tasks

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM-distilled product knowledge graphs offer real-time-feasible explainable recommendations — direct LLM calls are too latency-bound for production e-commerce

Can we distill LLM knowledge into graphs for real-time recommendations?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4