TOPIC

LLM Architecture

23 synthesis notes · 74 source papers

View as

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Standard RL training optimizes token outputs in multimodal models, but the real bottleneck may be where the model attends to visual information. Does steering attention directly outperform indirect optimization through final outputs?

Why do decoder-only models underperform as text encoders?

Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.

Can we prune training data without hurting model performance?

This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.

Can embedding future information in training data improve planning?

This explores whether inserting lookahead tokens containing future goals into training sequences helps models learn long-range planning without changing their architecture. The question matters because it tests whether data-level changes can produce architectural-level reasoning improvements.

Do embedding dimensions fundamentally limit retrievable document combinations?

Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.

Can models learn working memory by attending to their own latents?

Can a feedback loop letting transformers attend to their own internal representations enable them to process indefinitely long sequences without adding extra weights? This explores whether working memory can emerge from self-attention rather than external modules.

Does fixed sparsity work for all sequence lengths?

Production systems often apply the same sparsity budget regardless of input length. Does this one-size-fits-all approach actually work across short and long contexts, or does optimal sparsity vary with sequence length?

Can transformers learn to solve new problems within episodes?

Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Does sparse attention trade off quality for speed?

When sparse attention is compared fairly—larger sparse models versus smaller dense ones at the same compute cost—does it still represent a quality-cost trade-off, or does it actually improve performance?

Do language models sparsify their activations under difficult tasks?

When LLMs encounter unfamiliar or difficult inputs, do their internal representations become sparser rather than denser? Understanding this adaptive response could reveal how models stabilize reasoning under uncertainty.

Can LLMs reconstruct censored knowledge from scattered training hints?

When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Does optimal language model learning maximize data compression?

Can we derive principles for accelerating LM training by framing it as lossless compression? What does the optimal learning process look like when compression is the objective?

Why do accurate predictions lead to poor decisions?

Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.

Is representational sparsity learned or intrinsic to neural networks?

Explores whether sparsity in neural network activations is engineered through training or emerges as a default response to unfamiliar inputs. Understanding this distinction could reshape how we design and interpret model behavior.

Can transformers improve exponentially by learning from their own correct solutions?

Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.

How much sparsity can different reasoning tasks actually tolerate?

Different NLP tasks show vastly different tolerance for sparse attention—from 95% on simple QA to 50-67% on multi-hop reasoning. What structural differences explain this variation, and how should it shape deployment decisions?

Can representation sparsity order few-shot demonstrations effectively?

Does measuring how sparse a model's hidden states are for each example provide a reliable signal for ordering few-shot demonstrations in prompts? This matters because curriculum ordering significantly affects in-context learning performance.

Do strict output formats hurt LLM reasoning ability?

When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.

Why do neural networks fail at compositional generalization?

Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.

Can training data augmentation match test-time compute scaling benefits?

Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.

Does verbose chain-of-thought actually help multimodal perception tasks?

Extending RLHF to MLLMs through longer rationales follows the successful reasoning playbook, but may backfire on perception tasks. This explores when and why the standard CoT-and-RL recipe fails.

Source papers 74

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Survey on Large Language Models with some Insights on their Capabilities and Limitations
The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural lan…
Agent Learning via Early Experience
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data wi…
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Agentic Reasoning, a framework1 that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on i…
All AI Models are Wrong, but Some are Optimal
Abstract—AI models that predict the future behavior of a system (a.k.a. predictive AI models) are central to intelligent decision-making. However, decision-making using predictive AI models often resu…
Are Emergent Abilities in Large Language Models just In-Context Learning?
We present a novel theory that explains emergent abilities, taking into account their potential confounding factors, and rigorously substantiate this theory through over 1000 experiments. Our findings…
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep…
Artifacts as Memory Beyond the Agent Boundary
The situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent’s active use of environmental resources. Here, we begin formalizing this intuition w…
Ask, and it shall be given: Turing completeness of prompting
Since the success of GPT, large language models (LLMs) have been revolutionizing machine learning and have initiated the so-called LLM prompting paradigm. In the era of LLMs, people train a single gen…
Ask-AC: An Initiative Advisor-in-the-Loop Actor-Critic Framework
Despite the promising results achieved, state-of-the art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuo…
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawi…
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM1), a family of LLMs trained for recursive and …
Beyond neural scaling laws: beating power law scaling via data pruning
Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, thes…
Can Language Models Serve as Text-Based World Simulators?
Broadly speaking, there are two ways to leverage LLMs in the context of world modeling and simulation. The first is neurosymbolic: a number of efforts use language models to generate code in a symboli…
Compositional Reasoning with Transformers, RNNs, and Chain of Thought
Large language models [Touvron et al., 2023, Anil et al., 2023, Achiam et al., 2023] are increasingly used to perform logical reasoning and other problems that require algorithmic thinking. To underst…
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computat…
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
![A screenshot of a computer](/assets/paper-images/ConnectingTheDots.png) One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. …
DataComp-LM: In search of the next generation of training sets for language models
A key challenge in this emerging research area is a lack of controlled comparisons. While the aforementioned proposals generally use the same evaluation datasets, researchers often compare models that…
DeepNet: Scaling Transformers to 1,000 Layers
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection i…
Discovering Latent Concepts Learned in BERT
A large number of studies that analyze deep neural network models and their ability to encode various linguistic and non-linguistic concepts provide an interpretation of the inner mechanics of these m…
Eliciting Reasoning in Language Models with Cognitive Tools
The recent advent of reasoning models like OpenAI’s o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of r…
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (O…
Foundations of Large Language Models
The main part of BERT models is a multi-layer Transformer network. A Transformer layer consists of a self-attention sub-layer and an FFN sub-layer. Both of them follow the post-norm architecture: outp…
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLV…
Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
from creative writing and survey responses to research idea generation (Doshi and Hauser, 2024; Anderson et al., 2024; Moon et al., 2024). For instance, stories written with ChatGPT assistance were mo…
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involv…
Holy Grail 2.0: From Natural Language to Constraint Models
Twenty-seven years ago, E. Freuder highlighted that "Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the…
Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning
End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple be…
LIMI: Less is More for Agency
We define “Agency” as the emergent capacity of AI systems to function as autonomous agents—actively discovering problems, formulating hypotheses, and executing solutions through self-directed engageme…
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Some policy gradient approaches are explained below: Policy Gradient (REINFORCE). The REINFORCE algorithm [114, 115] is a method used to improve decision-making by adjusting the model’s strategy (poli…
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Large decoder-only language models (LLMs) are the state-of-the-art models on most of today’s NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks,…
Language Modeling by Language Models
Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stage…
Language Modeling is Compression
Information theory and machine learning are inextricably linked and have even been referred to as “two sides of the same coin” (MacKay, 2003). One particularly elegant connection is the essential equi…
Language Models Need Sleep
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consol…
Language Models are Pragmatic Speakers
How do language models “think”? This paper formulates a probabilistic cognitive model called bounded pragmatic speaker, which can characterize the operation of different variants of language models. I…
Large Concept Models: Language Modeling in a Sentence Representation Space
This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attem…
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediatio…
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language m…
Leveraging Approximate Symbolic Models for Reinforcement Learning via Skill Diversity
Creating reinforcement learning (RL) agents that are capable of accepting and leveraging task specific knowledge from humans has been long identified as a possible strategy for developing scalable app…
Looking beyond the next token
The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans’ natural writing and reasoning process, where …
MasRouter: Learning to Route LLMs for Multi-Agent Systems
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynam…
MatFormer: Nested Transformer for Elastic Inference
Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, …
Multi-Token Attention
Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and …
Nested Learning: The Illusion of Deep Learning Architectures
Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance th…
Neural Assistant: Joint Action Prediction, Response Generation, and Latent Knowledge Reasoning
Task-oriented dialog presents a difficult challenge encompassing multiple problems including multi-turn language understanding and generation, knowledge retrieval and reasoning, and action prediction.…
Neuro-Symbolic AI in 2024: A Systematic Review
Abstract Background: The field of Artificial Intelligence has undergone cyclical periods of growth and decline, known as AI summers and winters. Currently, we are in the third AI summer, characterized…
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate …
On the Binding Problem in Artificial Neural Networks
In this work, we argue that this underlying cause is the binding problem: The inability of existing neural networks to dynamically and flexibly bind information that is distributed throughout the netw…
On the Theoretical Limitations of Embedding-Based Retrieval
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, andmore. These new ben…
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
As John McCarthy (McCarthy, 1990, 1959) points out, in order to a better understanding of natural language, it is necessary for an intelligence system to understand the “deep structure” (Chomsky, 2011…
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgemen…
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
The capabilities and limitations of Large Language Models (LLMs) have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstr…
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant e…
RL + Transformer = A General-Purpose Problem Solver
What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., metalearn)? In this study, we demonstrate that a pre-…
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
To obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculatio…
Reinforced Attention Learning
Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) t…
SParC: Cross-Domain Semantic Parsing in Context
We present SParC, a dataset for cross-domain Semantic Parsing in Context. It consists of 4,298 coherent question sequences (12k+ individual questions annotated with SQL queries), obtained from control…
Scaling Laws for Neural Language Models
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, wit…
Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a selfimprovement approach where models iteratively…
Self-reinforcing cascades: A spreading model for beliefs or products of varying intensity or quality
Models of how things spread often assume that transmission mechanisms are fixed over time. However, social contagions–the spread of ideas, beliefs, innovations–can lose or gain in momentum as they spr…
Sleep-time Compute: Beyond Inference Scaling at Test-time
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time…
System 1 vs. System 2 Thinking
Abstract: This paper explores the dual-processing hypothesis of the mind, Systems 1 and 2, by examining debates between cognitive and evolutionary psychologists. I structure the discussion in a back-a…
Textgrad: Automatic “Differentiation” via Text
![A poster of a scientific experiment with medium confidence](/assets/paper-images/TextGradAutomatic-Differentiation-.png) ![A screenshot of a test](/assets/paper-images/TextGradAutomatic-Differentia…
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency–accuracy trade-offs remain unclear due to the lack of comprehensive evaluation.…
The Unreasonable Ineffectiveness of the Deeper Layers
We empirically study a simple layer-pruning strategy for popular families of openweight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until aft…
The Vanishing Gradient Problem for Stiff Neural Differential Equations
Neural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
There Will Be a Scientific Theory of Deep Learning
In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden …
Thinking Augmented Pre-training
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for …
Titans: Learning to Memorize at Test Time
Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory…
Towards Optimal Learning of Language Models
This work studies the general principles of improving the learning of language models (LMs), which aims at reducing the necessary training steps for achieving superior performance. Specifically, we pr…
Training language models to follow instructions with human feedback
Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpf…
TransformerFAM: Feedback attention is working memory
Abstract While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM),…
Understanding LLMs: A Comprehensive Overview from Training to Inference
The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There’s an increasing focus on cost-efficient training…
Unifying Large Language Models and Knowledge Graphs: A Roadmap
Abstract—Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and general…
What are the Goals of Distributional Semantics?
Distributional semantic models have become a mainstay in NLP, providing useful features for downstream tasks. However, assessing long-term progress requires explicit long-term goals. In this paper, I …