Real-time News Story Identification

Paper · arXiv 2508.08272 · Published July 30, 2025
Social Media and AINLP and LinguisticsNatural Language Inference

To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a news monitoring system that automatically collects news articles as they appear online and processes them in various ways. Story identification aims to assign each news article to a specific story that the article is covering. The process is similar to text clustering and topic modeling, but requires that articles be grouped based on particular events, places, and people, rather than general text similarity (as in clustering) or general (predefined) topics (as in topic modeling). We present an approach to story identification that is capable of functioning in real time, assigning articles to stories as they are published online. In the proposed approach, we combine text representation techniques, clustering algorithms, and online topic modeling methods. We combine various text representation methods to extract specific events and named entities necessary for story identification, showing that a mixture of online topic-modeling approaches such as BERTopic, DBStream, and TextClust can be adapted for story discovery.

Introduction. The ever-increasing amount of global news presents an overwhelming challenge for individuals and organizations striving to keep abreast of relevant information. The sheer volume of unstructured text data makes it nearly impossible for users to consume and comprehend all the information pertinent to their interests. News monitoring and analysis systems can help us by employing advanced algorithms to address the problems of size and opinionated views, as well as misinformation. Sensible grouping of news, such as identifying common reported events or clustering, allows users to quickly glean essential information from a set of articles, saving time and reducing cognitive load. The overall targeted news monitoring system archive comprises over 85 million articles from eight countries (Serbia, Slovenia, Bosnia and Herzegovina, Croatia, Macedonia, Montenegro, Kosovo, and Albania) in more than seven languages (see Figure 1) spanning over twenty years.

Discussion / Conclusion. In our work, we presented an online clustering system fine-tuned for story identification. We investigated several text embedding methods and suggested improvements to online clustering to show how existing approaches can be improved to better suit story identification. Our evaluations show improvements when using named entity recognition, explicit outlier detection, and optimization to micro-cluster merging. While we evaluated our approach on a relatively large dataset, further evaluation in a practical setting shall determine how well the developed approach scales in real-world use, particularly in systems that process large amounts of data for longer periods of time. As natural language processing technologies are rapidly advancing, periodic reassessment of new embedding approaches might bring further benefits.