TOPIC

Novel LLM Architectures

23 synthesis notes · 85 source papers

View as

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Can computational power accelerate scientific discovery itself?

Does the pace of research breakthroughs scale with computing resources, like model performance does? ASI-ARCH tested this by running thousands of autonomous experiments to discover neural architectures.

Can byte-level models match tokenized performance with better efficiency?

Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?

Can AI systems improve themselves through trial and error?

Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.

Can energy minimization unlock reasoning without domain-specific training?

Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?

Can evolutionary search beat sampling and revision at inference time?

Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.

Can extreme task decomposition enable reliable execution at million-step scale?

Can breaking tasks into maximally atomic subtasks with voting-based error correction solve the fundamental reliability problem in long-horizon tasks? This challenges whether better models or better decomposition is the path to high-reliability AI systems.

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Can indirect reasoning methods solve problems direct chain-of-thought cannot?

Most language model reasoning follows direct forward steps from premises. But some logical problems may require contrapositive or proof-by-contradiction approaches. Can these indirect methods overcome limitations of direct reasoning alone?

Can algorithms control LLM reasoning better than LLMs alone?

Explores whether embedding LLMs within algorithmic control flow—where programs manage state and context filtering—enables complex task decomposition beyond what LLMs achieve through self-managed reasoning chains.

Can a coordination layer turn LLM patterns into genuine reasoning?

LLMs excel at pattern retrieval but lack external constraint binding. Can a System 2 coordination layer—anchoring outputs to goals and evidence—transform statistical associations into goal-directed reasoning?

Does long chain of thought reasoning follow molecular bond patterns?

Can we understand extended reasoning as organized like molecular structures with distinct interaction types? This matters because it explains why mixing reasoning traces from different sources often fails despite similar statistics.

Can machine feedback sustain discovery at test time?

Can LLMs paired with automated evaluators discover genuinely novel solutions through iterative refinement, rather than just generating hypotheses? This matters because it tests whether autonomous research scales beyond benchmarks to real deployed innovations.

Source papers 85

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
To address this limitation, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging …
A Survey on Diffusion Language Models
Abstract—Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterati…
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
1 Introduction Reinforcement learning (RL) has emerged as a new scaling paradigm for enhancing the capabilities of large language models (LLMs) by enabling thinking abilities [52]. Given a prompt, RL…
Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models
Abstract: Recent advances in prompting techniques and multi-agent systems for Large Language Models (LLMs) have produced increasingly complex approaches. However, we lack a framework for characterizin…
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either …
AlphaEvolve: A coding agent for scientific and algorithmic discovery
In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific…
AlphaGo Moment for Model Architecture Discovery
While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bott…
Base Models Know How to Reason, Thinking Models Learn When
Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasonin…
Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation
Abstract—Intelligence is fundamentally non-ergodic: it emerges not from uniform sampling or optimization from scratch, but from the structured reuse of prior inference trajectories. We introduce Memor…
Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need
Language models traditionally utilized for cross-domain generalization in natural language understanding and generation have recently demonstrated task-specific reasoning through inference-time scalin…
Byte Latent Transformer: Patches Scale Better Than Tokens
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inferen…
Cognitive Architectures for Language Agents
Recent efforts have incorporated large language models (LLMs) with external resources (e.g., the Internet) or internal control flows (e.g., prompt chaining) for tasks requiring grounding or reasoning.…
Conversational Graph Grounded Policy Learning for Open-Domain Conversation Generation
To address the challenge of policy learning in open-domain multi-turn conversation, we propose to represent prior information about dialog transitions as a graph and learn a graph grounded dialog poli…
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Most of today’s AI systems are constrained by human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The scientific method, on the other hand, provides a cumu…
Deep Researcher with Test-Time Diffusion
Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time …
DeepNet: Scaling Transformers to 1,000 Layers
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection i…
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) De…
Dialogue State Tracking with a Language Model using Schema-Driven Prompting
Task-oriented conversational systems often use dialogue state tracking to represent the user’s intentions, which involves filling in values of pre-defined slots. Many approaches have been proposed, of…
Diffusion-LM Improves Controllable Text Generation
Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sente…
Efficient Streaming Language Models with Attention Sinks
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during …
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaq…
End-to-End Test-Time Training for Long Context
On the other hand, Transformers with self-attention still struggle to efficiently process long context equivalent to years of human experience, in part because they are designed for nearly lossless re…
Energy-Based Transformers are Scalable Learners and Thinkers
Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limita…
Evolving Deeper LLM Thinking
We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine c…
Extrapolation by Association: Length Generalization Transfer in Transformers
Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this pa…
Fast, Slow, and Tool-augmented Thinking for LLMs: A Review
Large Language Models (LLMs) have demonstrated remarkable progress in reasoning across diverse domains. However, effective reasoning in real-world tasks requires adapting the reasoning strategy to the…
From Language to Logic: A Bi-Level Framework for Structured Reasoning
Structured reasoning over natural language inputs remains a core challenge in artificial intelligence, as it requires bridging the gap between unstructured linguistic expressions and formal logical re…
Generalization to New Sequential Decision Making Tasks with In-Context Learning
However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment’s stochasticity or the agent’s actions can lead to unseen, and som…
Generative Recursive Reasoning
How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative …
Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models
We present Quasar-1, a novel architecture that introduces temperature-guided reasoning to large language models through the Token Temperature Mechanism (TTM) and Guided Sequence of Thought (GSoT). Our…
HiTKG: Towards Goal-Oriented Conversations via Multi-Hierarchy Learning
![A screenshot of a diagram](/assets/paper-images/HiTKGTowardsGoalOrientedConversations.png) Human conversations are guided by short-term and longterm goals. We study how to plan short-term goal sequ…
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing…
Hierarchical Reasoning Model
Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT…
Hyperagents
Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to recursive self-improvement typical…
Insert-expansions For Tool-enabled Conversational Agents
“This paper delves into an advanced implementation of Chain-of-Thought-Prompting in Large Language Models, focusing on the use of tools (or "plug-ins") within the explicit reasoning paths generated by…
It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
![[Pasted image 20251216072354.png]] Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the…
Jamba: A Hybrid Transformer-Mamba Language Model
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layer…
LLMatic: Neural Architecture Search via Large Language Models and Quality Diversity Optimization
Large language models (LLMs) have emerged as powerful tools capable of accomplishing a broad spectrum of tasks. Their abilities span numerous areas, and one area where they have made a significant imp…
Language Modeling by Language Models
Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stage…
Language Models Need Sleep
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consol…
Large Causal Models From Large Language Models
We introduce a new paradigm for building large causal models (LCMs) that exploits the enormous potential latent in today’s large language models (LLMs). We describe our ongoing experiments with an imp…
Large Language Diffusion Models
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre…
Large Language Model Programs
In recent years, large pre-trained language models (LLMs) have demonstrated the ability to follow instructions and perform novel tasks from a few examples. The possibility to parameterise an LLM throu…
Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning
Recently, increasing attention has been drawn to improve the ability of Large Language Models (LLMs) to perform complex reasoning. However, previous methods, such as Chain-of-Thought and Self- Consist…
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediatio…
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
Abstract Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear compu…
Looking beyond the next token
The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans’ natural writing and reasoning process, where …
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
Large language models (LLMs) (Brown et al., 2020) are known to acquire substantial factual knowledge during pretraining, storing it in their parameters (Geva et al., 2023). However, how effectively th…
MatFormer: Nested Transformer for Elastic Inference
Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, …
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Large Language Models (LLMs) employ autoregressive decoding that requires sequential computation, with each step reliant on the previous one’s output. This creates a bottleneck as each step necessitat…
Memory Sandbox: Transparent and Interactive Memory Management for Conversational Agents
“Large Language Models (LLMs) are currently capable of generating human-like responses in open-domain tasks [4]. This has led to a new generation of conversational agents, such as chatGPT, which are n…
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
We introduce MiniMax-M1, the world’s first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning …
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say
Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LL…
Multi-Token Attention
Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and …
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offe…
Nested Learning: The Illusion of Deep Learning Architecture Expanded
In the previous sections, we discussed the concept of nested learning and how existing well-known components of neural networks such as popular optimizers and architectures fall under the NL paradigm.…
Nested Learning: The Illusion of Deep Learning Architectures
Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance th…
Neurosymbolic AI- Why, What, and How
![A diagram of a scientific method with medium confidence](/assets/paper-images/NeurosymbolicAIWhy-What-AndHow.png) Abstract—Humans interact with the environment using a combination of perception - …
Post-Completion Learning for Language Models
Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos) token, overlooking the potential learning opportunities in the post-completion space. We…
Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer langua…
Reasoning Language Models: A Blueprint
such as OpenAI’s o1 and o3, DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending large language models (LLMs) with advanced reasoning mechanisms. Yet, their hi…
Recursive Language Models
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy…
Reinforced Attention Learning
Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) t…
Reinforcement Pre-Training
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a rea…
Repeat After Me: Transformers are Better than State Space Models at Copying
Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer…
Representation biases: will we achieve complete understanding by analyzing representations?
A common approach in neuroscience is to study neural representations as a means to understand a system—increasingly, by relating the neural representations to the internal representations learned by c…
Rethinking Thinking Tokens: LLMs as Improvement Operators
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accur…
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up
we propose Reversal of Thought (RoT), a novel framework aimed at enhancing the logical reasoning abilities of LLMs. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrate…
SPICE: Self-Play In Corpus Environments Improves Reasoning
Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts …
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling…
Self-Discover: Large Language Models Self-Compose Reasoning Structures
*Table 2. All 39 reasoning modules consisting of high-level cognitive heuristics for problem-solving. We adopt them from Fernando et al.* (_2023_). Reasoning Modules 1 How could I devise an experim…
Solving a Million-Step LLM Task with Zero Errors
LLMs have achieved remarkable breakthroughs in reasoning, insights, and tool use, but chaining these abilities into extended processes at the scale of those routinely executed by humans, organizations…
SpikingBrain: Spiking Brain-inspired Large Models
Mainstream Transformer-based large language models (LLMs) face significant efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly. …
The Consensus Game: Language Model Generation via Equilibrium Search
When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using…
The Future of AI: Exploring the Potential of Large Concept Models
these models are inherently limited by their token-level processing, which restricts their ability to perform abstract reasoning, conceptual understanding, and efficient generation of long-form conten…
The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics
Influential critiques argue that Large Language Models (LLMs) are a dead end for AGI: “mere pattern matchers” structurally incapable of reasoning or planning. We argue this conclusion misidentifies th…
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learn…
The Serial Scaling Hypothesis
While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems—from mathematical…
The Vanishing Gradient Problem for Stiff Neural Differential Equations
Neural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
Thinking Augmented Pre-training
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for …
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces (“thinking” more) before producing a response. In agent problems that require interaction, this can be do…
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a si…
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. De…
Transformer2: Self-adaptive LLMs
Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse…
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
While Transformers have been the main architecture behind deep learning’s success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transfor…

Novel LLM Architectures

Are neural network optimizers actually memory systems?

Can computational power accelerate scientific discovery itself?

Can byte-level models match tokenized performance with better efficiency?

Can AI systems improve themselves through trial and error?

Can energy minimization unlock reasoning without domain-specific training?

Can evolutionary search beat sampling and revision at inference time?

Can extreme task decomposition enable reliable execution at million-step scale?

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can indirect reasoning methods solve problems direct chain-of-thought cannot?

Can algorithms control LLM reasoning better than LLMs alone?

Can a coordination layer turn LLM patterns into genuine reasoning?

Does long chain of thought reasoning follow molecular bond patterns?

Can machine feedback sustain discovery at test time?

Can cognition work by reusing memory instead of recomputing?

Can AI systems discover better neural architectures than humans?

Can models learn to evaluate their own work during training?

Can recurrence consolidate memory without predicting tokens?

Can looped transformers generalize to unseen knowledge combinations?

Can models dynamically activate expert skills at inference time?

Can spiking neurons make transformers efficient on any hardware?

Is long-context bottleneck really about memory or compute?

Can parallel architectures solve inherently sequential problems?

Can state-space models match transformers at copying and retrieval?

Source papers 85