Can LLMs explain recommenders by mimicking their internal states?
Can training language models to align with both a recommender's outputs and its internal embeddings produce explanations that are both faithful and human-readable? This explores whether dual-access interpretation solves the fundamental tension between behavioral accuracy and interpretability.
Conventional explainability for recommenders trains a separate surrogate model to mimic the target's predictions and reads off feature importance from the surrogate. This works at a behavioral level — the surrogate predicts what the target predicts — but doesn't probe internal mechanism. It's a black-box explanation of a black-box.
RecExplainer's three-tier alignment scheme bridges this gap. Behavior alignment is the conventional surrogate: feed the LLM user profile text and train it to predict the items the target recommender would suggest. The LLM learns to reproduce target predictions from textual input.
Intention alignment goes deeper. Instead of giving the LLM only text, it incorporates the target recommender's neural-layer activations (the embeddings of users and items in the target's latent space) into the LLM's prompt. The LLM is fine-tuned to understand these embeddings as a multimodal input — text and recommendation-model embeddings are two modalities. Predictions now leverage the target's internal representation, not just its outputs.
Hybrid alignment combines both: text and embeddings together. The LLM produces explanations that integrate the human-interpretable reasoning the text supports and the high-fidelity behavior matching the embeddings provide.
The general principle: when you need to interpret a black-box model, behavioral mimicry and internal-state inspection are complementary. Each alone is partial — behavioral mimicry misses the mechanism, internal inspection misses the human-readable explanation. Combining them produces explanations that are both faithful to the target and intelligible to users. The pattern generalizes beyond recommendation: any model interpretation problem benefits from this dual access.
Inquiring lines that use this note as a source 14
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does explanation fluency mislead users about actual recommendation procedures?
- Can alignment techniques make LLM explainers match their recommendation behavior?
- Can topic embeddings make RL dialogue recommendations interpretable to clinicians?
- How do aspect-aware retrieval and surrogate models compare as explainability approaches?
- Can persona-attention mechanisms explain recommendations better than external surrogate models?
- How can aspect extraction from reviews personalize recommendation explanations?
- How does optimizing model performance decouple from optimizing user interpretability?
- Why do user studies of explanations fail to predict deployed effectiveness?
- What makes a neural network circuit actually interpretable to humans?
- How do large pretrained language models scale the unified recommendation paradigm?
- Can models be trained to explain instead of imitate answers?
- Should explanation quality be measured by user satisfaction or behavior prediction?
- How can we probe LLM representations in channels that training did not target?
- How faithful are natural language explanations from LLMs really?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do LLM explanations faithfully describe their recommendation process?
When LLMs recommend items to groups, do their explanations match how they actually made the choice? This matters because users trust explanations to understand AI decision-making.
tension with: RecExplainer tries to align LLM-explainer behavior with the underlying model — exactly the alignment LLM-as-explainer fails by default
-
Can retrieval enhancement fix explainable recommendations for sparse users?
When users have few historical interactions, embedded recommendation models struggle to generate personalized explanations. Can augmenting sparse histories with retrieved relevant reviews—selected by aspect—overcome this fundamental data limitation?
complements: surrogate-model interpretability and aspect-aware retrieval are alternative answers to the explainable-recommendation problem
-
Can attention mechanisms reveal which user taste explains each recommendation?
Single-vector user models collapse diverse tastes into one representation, losing expressiveness. Can weighting multiple personas by item relevance surface the right taste at the right time while making recommendations traceable?
complements: persona-attention explains via the recommender's own structure; RecExplainer trains an external LLM to mimic — different routes to interpretability
-
Does processing ease mislead users about their own competence?
When AI generates polished output, do users mistake the fluency of that output as evidence of their own understanding or skill? This matters because it could systematically inflate self-assessment across millions of AI interactions.
tension with: LLM-generated explanations are fluent regardless of fidelity — the trust risk is that surrogate output reads as authoritative even when alignment fails
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RecExplainer: Aligning Large Language Models for Recommendation Model Interpretability
- Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
- Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations
- Large Language Models are Zero-Shot Rankers for Recommender Systems
- A Multi-facet Paradigm to Bridge Large Language Model and Recommendation
- Large Language Models for User Interest Journeys
- Leveraging Large Language Models in Conversational Recommender Systems
- CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation
Original note title
RecExplainer uses LLM as surrogate model with three alignment methods — behavior intention and hybrid for recommendation interpretability