What implicit knowledge about catalogs do LLMs learn from ranking signals alone?
This explores what LLMs absorb about a store's inventory — what's in it, how items relate, what counts as relevant — when the only thing they're trained on is a ranking score, never the catalog itself.
This explores what LLMs absorb about a store's inventory — what's in it, how items relate, what counts as relevant — when the only thing they're trained on is a ranking score, never the catalog itself. The corpus's most direct answer is surprising: the catalog never has to be shown. In the Rec-R1 experiments Can LLMs recommend products without ever seeing the catalog?, a model is trained purely on the recommender's own success metrics — did this query surface things people clicked? — and it learns to write effective product searches without ever reading the inventory. The reward signal alone teaches it the shape of what's findable. The companion note frames this as treating ranking scores like NDCG and Recall as a black-box RL reward Can recommendation metrics train language models directly?: the LLM never sees the catalog schema, yet the metric quietly encodes which words match real merchandise and which fall flat.
What's actually being learned, then, is a kind of negative space — not the items themselves but the contours of relevance around them. The parallel the corpus draws is to how a human shopper searches a site whose full inventory they've never seen: you refine "running shoes" to "trail running shoes waterproof" not because you memorized the warehouse but because the results push back. The ranking signal is that pushback, compressed into a gradient.
But this implicit knowledge has sharp edges, and the rest of the corpus maps them. A ranking metric rewards *what* surfaces, not *when* — so models trained this way inherit a blind spot for order. Zero-shot rankers systematically ignore the temporal sequence of a user's history unless prompting explicitly wakes that sensitivity up Why do language models ignore temporal order in ranking?. Ranking signals alone teach relevance, not recency. Similarly, the signal teaches the LLM to *retrieve* well without teaching it to *be* a ranker: several notes find that LLMs are more valuable enriching item text — paraphrases, summaries, attributes fed to a traditional recommender — than making the final call themselves Does LLM input augmentation beat direct LLM recommendation?. When you do want the ranking objective baked into the language itself, you have to train for it directly, as with summaries optimized against downstream relevance scores rather than fluent prose Can reinforcement learning align summarization with ranking goals?.
There's a deeper caution worth pulling in from outside the recommendation papers. Knowing how to surface a catalog item is not the same as understanding the catalog. The interpretability work on tiers of understanding shows that LLM competence is a patchwork — useful heuristics layered under, not replaced by, deeper structure Do language models understand in fundamentally different ways? — and the Potemkin failure mode shows models that can describe a concept yet fail to apply it Can LLMs understand concepts they cannot apply?. Catalog knowledge learned from ranking signals is exactly this kind of operational-but-shallow competence: the model behaves as if it knows the inventory without holding any explicit model of it. That's the thing you didn't know you wanted to know — the catalog awareness is real and exploitable, but it lives entirely in the model's behavior, not in anything it could tell you it knows. If you want grounding the model can actually point to, you have to build it in structurally, the way multi-facet identifiers stitch IDs, titles, and attributes together so generation stays tethered to real items Can item identifiers balance uniqueness and semantic meaning?.
Sources 8 notes
Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.
Using LLMs to augment item descriptions with paraphrases, summaries, and categories—then feeding enriched text to traditional recommenders—beats asking LLMs to recommend directly. The mechanism: LLMs excel at content understanding but lack specialized ranking bias, so their textual enrichment is more valuable than their predictions.
ReLSum trains summarizers using downstream relevance scores as RL rewards, producing dense, attribute-focused summaries instead of fluent prose. This alignment to the actual ranking metric improves recall, NDCG, and user engagement in production e-commerce search.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
TransRec shows that combining numeric IDs, titles, and attributes into structured identifiers solves three problems simultaneously: distinctiveness from IDs, semantics from text, and generation grounding from structural constraints. Neither pure IDs nor pure text alone achieves all three.