TiMoE: Time-Aware Mixture of Language Experts
Large language models (LLMs) are typically trained on fixed snapshots of the web, which means that their knowledge becomes stale and their predictions risk temporal leakage: relying on information that lies in the future relative to a query. We tackle this problem by pre-training from scratch a set of GPT-style experts on disjoint two-year slices of a 2013–2024 corpus and combining them through TiMoE, a Time-aware Mixture of Language Experts. At inference time, TiMoE masks all experts whose training window ends after the query timestamp and merges the remaining log-probabilities in a shared space, guaranteeing strict causal validity while retaining the breadth of multi-period knowledge. We also release TSQA, a 10k-question benchmark whose alternatives are explicitly labelled as past, future or irrelevant, allowing fine-grained measurement of temporal hallucinations. Experiments on eight standard NLP tasks plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best singleperiod expert and cuts future-knowledge errors by up to 15 %. Our results demonstrate that modular, time-segmented pre-training paired with causal routing is a simple yet effective path toward LLMs that stay chronologically grounded without sacrificing general performance much.
Introduction. Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks and domains. However, they still lack a robust understanding of time (Ge et al., 2024; Mousavi et al., 2024; Zhu et al., 2025). For instance, on the FRESHQA benchmark (Vu et al., 2024), GPT-4 achieves only 14% accuracy on the “fast-changing” subset. This temporal unawareness undermines the reliability of LLMs—particularly in critical fields like medicine, where clinical guidelines, drug approvals, and frontline therapies evolve rapidly. In such contexts, relying on outdated information can mislead patients, resulting in inappropriate treatments or the misuse of antibiotics. While post-hoc verification with external tools or retrieval can mitigate some issues, it is preferable to embed time-awareness directly in the model’s parameters. Standard pretraining, however, typically merges data from all time periods into a single model indiscriminately.
Discussion / Conclusion. TiMoE demonstrates that partitioning pre-training data into strict time slices and blending the resulting GPT-2 experts through a causal, timestamp-aware router yields language models that stay chronologically grounded without a heavy accuracy penalty. By masking out any expert trained on data newer than the query year, TiMoE eliminates future-knowledge leakage while letting earlier specialists cooperate, cutting temporally inconsistent answers on the new 10 k-question TSQA benchmark by roughly 15%and delivering steadier accuracy across years. On eight standard NLP benchmarks, our models slightly underperform a standard trained GPT2 model, highlighting a manageable “cost of time-awareness” rather than a fundamental barrier. Overall, TiMoE offers a simple, modular recipe for building LLMs that respect temporal causality and can serve as a foundation for scalable, ever-evolving models and for diachronic analyses of how knowledge shifts over time.