Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Paper · arXiv 2508.09834 · Published August 13, 2025
Reinforcement Learning

Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

Introduction. In recent years, Large Language Models (LLMs) have emerged extraordinary capabilities in understanding and generating natural language have driven substantial progress across a wide range of tasks, including text generation [1, 2, 3], code generation [4, 5, 6], question answering [7, 8], and machine translation [3, 9]. Prominent LLM families such as ChatGPT [2, 10, 11, 12, 13, 14, 15, 16, 17], Claude [18, 19, 20, 21, 22], Gemini [23, 24, 25], DeepSeek [26, 27, 28, 29], Qwen [30, 31, 32, 33], LLaMA [34, 35, 36, 37], GLM [38], Minimax-Text [39], InternLM [40, 41], Hunyuan [42, 43] have continuously pushed the boundaries of performance, while also reshaping how people interact with machines in daily life. Beyond their initial role in language tasks, LLMs are increasingly being applied in two demanding areas: multimodality and complex reasoning. In multimodal applications, LLMs now play a key role in systems that integrate and generate information across multiple data types.

Discussion / Conclusion. In this survey, we have reviewed the key architectural innovations and optimization strategies developed to overcome the efficiency bottlenecks of Transformer-based models. We highlighted how the quadratic cost of self-attention and the growth of FFN layers drive up both computation and memory demands, especially in long-sequence, multimodal, and multi-step reasoning scenarios. We categorized recent solutions into seven main areas: linear sequence modeling, sparse sequence modeling, efficient full attention, sparse mixture of experts, hybrid architectures, diffusion LLMs, and cross-modal applications. For each category, we examine the core ideas and underlying technical details, summarize representative works, and analyze the strengths and limitations of them. By organizing these approaches systematically, we aim to provide a clear picture of the current landscape and the common challenges they address. Efficient Architectures Design.