TOPIC

Context Engineering

14 synthesis notes · 20 source papers

View as

Can a reasoning model's thinking trace compress context effectively?

Does the raw reasoning trace produced by a thinking model naturally function as a context compressor without specialized training or modules? And how does this compare to dedicated compression methods?

How much should we trust AI-generated data in inference?

Most AI workflows treat synthetic data with implicit full trust, but should there be an explicit parameter controlling how heavily AI outputs influence downstream reasoning and decision-making?

Can language models learn skills without human supervision?

Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?

Why can language models understand context better than generate it?

Models absorb and process rich input context far more effectively than they produce similarly sophisticated outputs. Understanding this asymmetry could reshape how we design systems to compensate for generative limitations.

Can context playbooks prevent knowledge loss during iteration?

When AI systems iteratively refine their instructions and memories, do structured incremental updates better preserve domain knowledge than traditional rewriting? This matters because context degradation undermines long-term agent performance.

Can external managers compress context better than frozen agents?

Explores whether offloading context management to a trained external system can adapt compression strategies to individual agent strengths, rather than forcing agents to manage their own context constraints.

How much does demo position alone affect in-context learning accuracy?

Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?

Do foundation models actually reduce our need for real data?

As AI systems grow more powerful, does empirical observation become less necessary? This explores whether foundation models can substitute for ground truth or whether they instead demand stronger empirical anchoring.

Can frozen models learn better by extracting context into skills?

When a model encounters unfamiliar material in its context, can we help it reason more effectively by explicitly extracting rules and procedures from that material rather than changing the model itself?

Can length generalization transfer between different related tasks?

Can a model trained on longer sequences in one task learn to handle longer inputs in a related task without explicit training? This matters for understanding how neural networks reuse computational strategies across problems.

Should we treat LLM outputs as real empirical data?

Can synthetic text generated by language models serve as evidence in the same way observations from the world do? This matters because researchers increasingly rely on AI-generated content without accounting for its fundamentally different epistemic status.

Source papers 20

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Survey of Context Engineering for Large Language Models
Abstract: The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal disc…
Activation Steering for Chain-of-Thought Compression
Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as chains of thought (CoTs). However, these rationales are often overly verbose, even for simple pro…
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation—modifying inputs with instructions, strategies, or evidence, rather than we…
Behavioral Exploration: Learning to Explore via In-Context Adaptation
While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely…
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
The remarkable capability of Transformers to do reasoning and few-shot learning, without any fine-tuning, is widely conjectured to stem from their ability to implicitly simulate a multi-step algorithm…
Context Tuning for Retrieval Augmented Generation
“Large language models (LLMs) have the remarkable ability to solve new tasks with just a few examples, but they need access to the right tools. Retrieval Augmented Generation (RAG) addresses this prob…
Extrapolation by Association: Length Generalization Transfer in Transformers
Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this pa…
Foundation Priors
Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these “synthetic” outputs as data in empirical research and d…
From Context to Skills: Can Language Models Learn from Context Skillfully?
Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge…
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation model…
Language Models Need Sleep
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consol…
Learning Agent-Compatible Context Management for Long-Horizon Tasks
LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Pr…
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can resul…
Memory in the Age of AI Agents: A Survey — Forms, Functions and Dynamics
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. It underpins long-horizon reasoning, continual adaptation, and effective interaction with complex e…
Recursive Language Models
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy…
Test-time Prompt Intervention
Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning ca…
The AI Hippocampus: How Far are We From Human Memory?
Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs). As these models transition from…
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency–accuracy trade-offs remain unclear due to the lack of comprehensive evaluation.…
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compress…
Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. Howeve…