Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics

Paper · arXiv 2507.12372 · Published July 16, 2025

Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web-browsing capabilities, enabling real-time information retrieval and multi-step reasoning over live web content. While prior studies have demonstrated LLMs’ ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web-browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post-API era, it also raises risks of misuse—particularly in information operations and targeted advertising—underscoring the need for safeguards.

Introduction. Social media platforms generate vast amounts of user-generated text, which social science researchers analyze to understand public opinion, online behavior, and the spread of misinformation [3, 6]. Traditionally, computational methods for social media analysis relied on specialized classifiers or machine learning models trained on manually crafted features [2, 33, 36]. However, over the past few years, Large Language Models (LLMs) have begun to transform how we approach these tasks. These models offer new capabilities for automating content analysis and have been proposed as tools for enhancing survey research and online experiments in the social sciences [6]. At the same time, their use raises important concerns related to research ethics and the protection of human subjects, particularly regarding transparency, consent, and data privacy [29, 32]. Empirical studies have shown that LLMs excel at a wide range of social media data analysis tasks.

Discussion / Conclusion. We demonstrated that recent web-browsing large language models (LLMs) are capable of directly accessing social media accounts data and analyze their content. In our evaluation, GPT-4o, GPT-o3, and Llama-3-8B-Web consistently performed the task without refusal. In contrast, Mistral initially appeared to execute the task but, when prompted for justification, acknowledged that it could not access social media content. To systematically assess the capabilities of web-browsing LLMs in analyzing social media data, we conducted two experiments: one using a dataset of 48 synthetic X (formerly Twitter) accounts, and another based on a 2018 survey containing participants’ self-reported Twitter handles and demographics. Results from both experiments confirm that these models can retrieve and process content from social media profiles. However, their accuracy in inferring user demographics varies markedly across models, attributes, and datasets.

Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics

Synthesis notes that discuss concepts related to this paper