How does candidate-conditional activation differ from static embedding-based feature crosses?

This explores a recommendation-systems distinction: computing a representation dynamically in light of the specific candidate being scored (candidate-conditional activation) versus precomputing fixed feature interactions from embeddings that never see the candidate (static feature crosses).

This explores how a model that activates features *in response to* the candidate it's evaluating differs from one that bakes feature interactions into fixed embeddings ahead of time. The corpus doesn't address recommendation feature crosses head-on, but it has surprisingly sharp material on the underlying gap — the difference between representations that are computed once and representations that are computed on demand. The cleanest framing comes from the observation that embeddings measure *semantic association, not task relevance* Do vector embeddings actually measure task relevance?. A static cross inherits exactly this limitation: two items can sit close in embedding space because they co-occur, even when one is the wrong answer for the current query. Candidate-conditional activation is, in effect, a bet that you can recover task relevance only by letting the candidate participate in the computation rather than being compared against a frozen summary.

The static side of the ledger isn't empty of meaning, though — that's the interesting tension. Static embeddings genuinely encode rich content (valence, concreteness, even taboo) before any attention or interaction fires Do transformer static embeddings actually encode semantic meaning?. So a precomputed cross isn't 'dumb'; it carries real lexical signal. The problem is that signal is *about the item in isolation*, not about the item-in-context. This is the same failure mode you see when strong parametric associations override the actual context a model is handed Why do language models ignore information in their context?: a fixed representation will confidently reuse what it already 'knows' about an item instead of reconditioning on the situation in front of it.

Candidate-conditional activation has a clear cousin in the inference-time composition literature. Transformer² shows models composing task-specific expert vectors *at inference*, mixing them dynamically per input rather than committing to one frozen weight configuration Can models dynamically activate expert skills at inference time?. That's the same move recommendation systems make when they let the candidate gate which features light up — the representation is assembled for this scoring event, not retrieved from a cache. There's even a deeper hint about *why* this matters: representational density is learned, with models defaulting to dense activations for familiar inputs and sparse ones for unfamiliar territory Is representational sparsity learned or intrinsic to neural networks?. Conditional activation is a way to push a model toward dense, engaged computation for the specific pairing rather than a generic, pre-baked one.

The most direct recsys-flavored counterpoint is VQ-Rec, which *decouples* item text from the recommender by mapping text into discrete codes that index learned, adaptable embeddings — deliberately breaking the tight, static coupling between an item's text and its representation so lookup tables can adapt without retraining Can discretizing text embeddings improve recommendation transfer?. Read alongside the question, this reframes the whole debate: a static feature cross hard-wires the text-to-relevance mapping, while both VQ-Rec's decoupling and candidate-conditional activation are different escapes from that rigidity. A related instinct shows up in zero-shot recognition, where routing through a natural-language *description* of the candidate beats direct embedding similarity Can describing images in text improve zero-shot recognition? — again, conditioning the comparison on a richer, candidate-specific signal outperforms a flat distance in embedding space.

The thing worth walking away with: the static-vs-conditional split isn't really about architecture, it's about *when relevance is decided*. Static crosses decide it at indexing time, on the basis of association; conditional activation defers the decision to scoring time, where the candidate gets to reshape the representation. Several corners of this corpus — task-relevance vs association, inference-time expert mixing, context losing to priors — all converge on the same lesson: freezing a representation too early trades adaptivity for speed, and the cost shows up precisely on the underspecified, wrong-but-associated cases.

Sources 7 notes

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

How does candidate-conditional activation differ from static embedding-based feature crosses?

Sources 7 notes

Next inquiring lines