INQUIRING LINE

How do input length constraints reshape personalization system design choices?

This explores how the hard limit on what you can fit into a model's context window (or onto a small device) forces personalization systems to choose what to throw away — and how that compression pressure, rather than being a nuisance, often points toward the design choices that actually work best.


This reads the question as: once you accept you can't feed a model everything you know about a user, what do you keep — and the corpus has a surprisingly consistent answer. Compression isn't just a tax you pay for limited input length; the act of deciding what to drop tends to surface the signal that personalization actually runs on. The most striking finding is that abstracting a user into preference summaries beats retrieving their raw past interactions. The PRIME work shows semantic memory — distilled preference knowledge — consistently outperforms episodic memory that pulls back specific logged exchanges Does abstract preference knowledge outperform specific interaction recall?. So the length constraint doesn't degrade the system; forcing abstraction improves it.

The same theme shows up from a different angle: you can often discard half the data. Profiles built only from a user's *outputs* match or beat profiles built from everything, while input-only profiles actively hurt — because personalization travels through style and preference, not the semantic content of what someone asked Do user outputs outperform inputs for LLM personalization?. That's a direct design lever for a tight budget: when you must cut, cut the queries, keep the outputs. And when you compress, *how* you encode matters — learned text summaries condition reward models more effectively than embedding vectors, capturing dimensions zero-shot encodings miss while staying short and human-readable Can text summaries beat embeddings for personalized reward models?.

Pushed further, the constraint stops being about text length at all and becomes about parameters. PReF reduces a whole user down to a handful of reward coefficients inferred from roughly ten adaptive questions — personalization carried in a few numbers rather than a long history Can user preferences be learned from just ten questions?. Lightweight trait adapters go even smaller, encoding personality into under 0.1% extra weights spread across transformer layers, sidestepping the prompt entirely Can we control personality in language models without prompting?. These are what 'design choices reshaped by length limits' looks like when you stop trying to fit context in the window and instead bake it into the model.

There's a real cost to over-compressing, though, and the corpus names it: persona sparsity. Squeeze a user down too far and an LLM judge loses predictive power on their specific preferences — the fix isn't more data but letting the model abstain when it's uncertain rather than guess Why do LLM judges fail at predicting sparse user preferences?. The granularity framework maps the whole tradeoff space: user-level is most precise but starves on data, persona-level scales but needs domain knowledge, global aggregates away the individual How do personalization granularity levels trade precision against scalability?. Choosing where to sit on that ladder *is* the length-constraint decision.

What the curious reader might not expect: the hardest constraints come from hardware, not prompts. On a phone, DRAM and battery — not quality preferences — force sub-billion-parameter models; a 7B model drains a 50kJ battery in under two hours while a 350M model runs all day What actually limits language models on mobile phones?. So the deepest version of this question isn't 'how much context fits' but 'how much intelligence fits on the device at all' — which is exactly why approaches that move personalization out of the input window and into compact preferences, coefficients, or adapter weights keep winning.


Sources 8 notes

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Why do LLM judges fail at predicting sparse user preferences?

Sparse persona information lacks predictive power for specific preferences, causing LLM judges to fail. Verbal uncertainty estimation recovers reliability above 80% on high-certainty samples by allowing abstention rather than forced judgment.

How do personalization granularity levels trade precision against scalability?

User-level personalization maximizes precision but faces data sparsity; persona-level scales better but requires domain knowledge; global preference is broadest but aggregates away individual differences. Four technique categories (RAG, prompting, representation, RLHF) map across these levels.

What actually limits language models on mobile phones?

Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.

Next inquiring lines