Tube2Vec: Social and Semantic Embeddings of YouTube Channels
Research using YouTube data often explores social and semantic dimensions of channels and videos. Typically, analyses rely on laborious manual annotation of content and content creators, often found by low-recall methods such as keyword search. Here, we explore an alternative approach, using latent representations (embeddings) obtained via machine learning. Using a large dataset of YouTube links shared on Reddit; we create embeddings that capture social sharing behavior, video metadata (title, description, etc.), and YouTube’s video recommendations. We evaluate these embeddings using crowdsourcing and existing datasets, finding that recommendation embeddings excel at capturing both social and semantic dimensions, although social-sharing embeddings better correlate with existing partisan scores. We share embeddings capturing the social and semantic dimensions of 44,000 YouTube channels for the benefit of future research on YouTube: https://github.com/epfldlab/youtube-embeddings.
Introduction. Consider the following three YouTube channels: Semantically, A is more similar to B than to C, as both A and B talk about guns. However, considering the Left–Right spectrum (in the US context), B is more similar to C than to A, as both positions they support – veganism and gun control – are more prevalent within the political left, whereas unlimited restrictions towards gun ownership, supported by A, is a position associated with the political right. Content’s semantics and social dimensions (social constructs projected across a linear scale) greatly concern research using (or about) online platforms. Researchers are sometimes interested in studying a specific topic and use heuristics to find semantically similar videos, tweets, or posts. For instance, on YouTube, a vast body of research has assessed the quality of medical information available on the platform (Madathil et al. 2015), where relevant videos and channels are usually found using the platform’s own search engine.
Discussion / Conclusion. We propose and systematically evaluate a variety of latent representations of YouTube channels. Using these embeddings, future work could construct datasets of channels through weak supervision or evaluate the prevalence of YouTube content through one or several of the social dimensions considered here. Further, our analyses suggest that embeddings created through social sharing and recommendation data meaningfully encode social dimensions like partisanship, age, and gender. We expect that the embeddings shared with this paper, and the methodology to create and evaluate them, will help computational social scientists study video- and image-centric social media like YouTube. We advise authors using our embeddings to use the recommendation embeddings provided, except if they are specifically interested in studying partisanship, in which case the social sharing embeddings may be preferred. Different embeddings each have their limitations. Social sharing embeddings rely on Reddit data, and thus we can only embed channels that are shared on another social platform (whose users reside primarily within English-speaking countries). Recommendation embeddings, while the most performant in our analyses, rely on the social sharing dimensions.