- Motivation
- Problem Statement
- Datasets
- Modeling Methodology
- Results
- Conclusions
- Directory Tree
- References
Twitter is one of the most popular social media platforms, and it currently has over 396.5 million users. Twitter is a microblogging and social networking service. Users post tweets, short text messages to interact with their followers. The text content of a tweet can contain up to 280 characters. Users can post and talk about any topic they wish to, provided they do not violate the Twitter rules. Given the open platform nature of the service, Twitter attracts a wide range and variety of users and topics, and users of the platform use it as an outlet to discuss and inform their viewpoints. Twitter is a valuable text mining data pool that can be leveraged to discover the underlying themes of conversations happening at any given time. E.g., an organization would like to understand the discussions around their products. However, manually reading and making sense of all the tweets to identify the themes and topics of discussions is not always practical. A text mining approach can be adopted to get a big picture idea of general discussion topics. Topic modeling can help distill information into a more usable and actionable format.
I analyze the Twitter activity of users from India over the last two years (Sep-2019 to Sep-2020). Topic models are used to discover and identify trends in the underlying themes in the topics being discussed on the platform. COVID played a big part in the last two years and greatly influenced how users interacted on social media platforms. Topic models can provide insights into any specific trends that could correlate with the rise and the fall in the number of COVID cases in India. Since topic modeling is an unsupervised machine learning technique, our analysis and results are driven by how well the topic models capture the relevant topics.
As a data science engineer at Twitter-India, I am tasked to improve Twitter's search algorithm to show more context-based results rather than just query-based results. I use topic models to discover latent themes in the users' posts specifically originating from India to improve the search engine's performance. Helping users find relevant information will lead to more engagement on the platform and drive user monetization.
I use Snscrape API to collect data which in this case are tweets posted by users in India. Snscrape is a scraper for social networking services (SNS). It has support for several social media services such as Twitter, Facebook, Instagram, etc. The full list of its supported services and its associated functionalities can be found on its Github page.
Tweet Scrapper Snscrape has a Python wrapper for Twitter with support for users, user profiles, hashtags, searches, threads, and list posts. It has 2 distinct advantages over Tweepy:
- You do not need an API key to scrape tweets.
- Snscrape's search function works the same way as Twitter's search.
I used the near
Geo
operator to filter out tweets by location around the top 25 cities by population in India. A typical search query to scrape tweets around a given city will look something like this:
f'near:{city} within:200km since:{since} until:{until} lang:en -filter:retweets -filter:replies filter:verified')
. The city
, since
and date
parameters represent a given city name, start date and end date of search query respectively. I applied additional filters to the search results to limit the number of search results.
I ended up collecting more than million tweets for the two year time period. However, after accounting for duplicates there were approximately 650k tweets in the dataset. These duplicates were due to overlapping search radius around the cities that was used in the search query.
LDA Models
Gensim is designed to process raw, unstructured digital texts (”plain text”) using unsupervised machine learning algorithms. In this project, I use the LDA algorithm of the gensim library. The primary steps in building a LDA model using Gensim are:
- Building a dictionary object which is assigning a unique id to each token. To do so, convert the texts to a list of tokens and pass it to Dictionary object.
- Building a document term matrix which is the gensim corpus object. It contains the word id and its frequency in each document.
- Building a model by passing on the corpus, dictionary, and the number of topics. There are other hyperparameters which can be used to fine tune the model results.
- LDAMulticore is the parallelized implementation of the LDA model.
- Investigate the topics and words associated with the topics
- Calculate model metrics
- Visualize the topics using pyLDAvis which is a python library for interactive topic model visualization.
A three step modeling approach for the LDA model is adopted
- Base LDA model for the entire corpus
- Tuning the Hyperparameters to improve over the base model
- Build a model for each month by limiting the corpus for that particular month by tuning their hyperparameters
The first two steps provide models that reveal topics of interest over the entire two year period. However, since topics evolve over time a monthly model should be better able to capture relevant topics for that given month instead of generalizing over a longer timeframe.
Short Text Topic Models-GSDMM
LDA models do not work very well with short texts such as tweets. LDA models identifies multiple topics in a given document. However, short texts such as tweets are usually focussed on a single topic. GSDMM models works on the premise that one topic in one document. To evaluate the performance of GSDMM, a single model was fitted to our corpus. This model takes 6 hours to run for 50 iterations on Google Colab. Hence, even though this model reveals much more coherent topics than the LDA model, it could not be more extensively used as the LDA model due to its slow run time.
Evaluation of Models
Coherence - It measures how the texts are semantically meaningful. It is the implementation of the four stage topic coherence pipeline. There are 4 different coherence models in gensim
, u_mass
, c_v
, c_uci
, c_npmi
. I have used the c_v
measure which is based on a sliding window, a one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosinus similarity. Higher c_v
is better and its value is between 0 and 1.
LDA Models
- The coherence score is for base LDA model with 25 topics 0.45
- The best LDA model as ranked by their coherence scores after tuning the hyperparameters has a number of topics of 15 and a decay of 0.5
- The distribution of word counts for all documents show that the document length is too short due to the removal of most occuring tokens
- The top 2 topics contribute almost 50% of documents
- The topics that can be identified by creating one model for every month are as follows:
Mon-Year | Topic Label |
---|---|
Sep - 2019 | very small dataset so not a coherent topic |
Oct - 2019 | Diwali wishes |
Nov - 2019 | Thanking for Birthday wishes |
Dec - 2019 | Student protests |
Jan - 2020 | Film promotion |
Feb - 2020 | Film promotion |
Mar - 2020 | Stay home covid messages |
Apr - 2020 | Lockdown |
May - 2020 | Lockdown and Migrant worker crisis |
Jun - 2020 | China-India border skirmishes |
Jul - 2020 | Family+Student, possibly related to school shutdowns due to covid |
Aug - 2020 | Family+Student, possibly related to school shutdowns due to covid |
Sep - 2020 | Thank you tweets |
Oct - 2020 | Thank you tweets |
Nov - 2020 | Diwali wishes |
Dec - 2020 | Farmers protest |
Jan - 2021 | Farmers protest |
Feb - 2021 | Birthday wishes to guru |
Mar - 2021 | Woman+Thank you messages |
Apr - 2021 | Covid 2nd wave, Oxygen and Hospital help |
May - 2021 | Covid 2nd wave, Oxygen and Hospital help |
June -2021 | End of 2nd Covid wave, Happy messages |
Jul - 2021 | Birthday wishes |
Aug - 2021 | India England Cricket Series |
Short Text Topic Modeling
The input number of clusters was 50 and the model ended up assigning the corpus to all the 50 clusters with 30 iterations. But the documents assigned to ~15 clusters is very low. Probably with more iterations the model would have assigned fewer than 50 clusters to our corpus. The Cluster-28 is most dominant cluster with almost 35,000 documents from the corpus.
Cluster # | Topic Label |
---|---|
Cluster 28 | Wishing the best |
Cluster 1 | Social Media and Fake News |
Cluster 24 | Birthday wishes and congratulations |
Cluster 6 | Invitations to join live stream |
Cluster 26 | Religion based Politics discussion |
Cluster 0 | Cricket and sports |
Cluster 49 | Covid messages specifically wishing well |
Cluster 32 | Woman, education and students |
Cluster 15 | Covid affecting students due to lockdown |
Cluster 39 | Bengal Elections and two national parties |
Cluster 30 | Stocks and companies |
Cluster 37 | Film promotions |
Cluster 44 | Audience review |
Cluster 11 | Covid lockdown and case fatalities |
Cluster 17 | Ruling party |
Cluster 2 | Visiting places |
Cluster 4 | Law enforcement |
Cluster 8 | Public protests |
Cluster 9 | Covid deaths and condolences |
Cluster 27 | Anniversary and possibly about Indian independence day |
- Snscrape API uses the public Twitter search results which severely limits the number and the quality of tweets that it can scrape. Most of the scraped tweets had overlapping themes.
- GSDMM model assigned more relevant topics as judged from a human interpretation point of view.
- GSDMM model's performance could be improved further by tuning the
alpha
andbeta
hyperparameters. - LDA models identified distinct topics with overlapping themes
- LDA Multicore implementation makes running and fine tuning the hyperparameters a much faster process and is a definite advantage over the GSDMM
CSV files for output
and tweets_data
can be downloaded using the links below. The folders in this repo are just placeholders to save the downloaded files.
Directory | Contents |
---|---|
code | jupyter-notebooks |
models | trained models |
output | csv files for cleaned tweets dataset and model evaluation https://1drv.ms/u/s!Ar52d2HxEkbnioYmVXXLJcWwBEXvUA?e=jPj5lX |
presentations | presentation |
tweets_data | raw tweets data for each city and India https://1drv.ms/u/s!Ar52d2HxEkbnkK0-WKfS6ZgDqi7gyA?e=OhInfQ |
Advanced Search Twitter Cheatsheet
Topic Modeling Visualization
GSDMM Modeling
LDA Model Evaluation
Parallelizing Spacy Pipeline