SentPred is a project that aims to understand the shifts in public sentiment across the United States during the COVID-19 pandemic. Utilizing a large dataset of tweets, we applied Natural Language Processing (NLP) techniques, specifically the BERT model, to analyze the sentiment embedded in each tweet. Our approach offers a detailed, time-sensitive and location-specific snapshot of sentiment changes across the country throughout the pandemic.
Our primary dataset consists of tweets collected from January 2020 to February 2021, covering the span of the COVID-19 pandemic. Each tweet, besides its text content, includes a timestamp and is associated with a specific US state. This rich dataset, containing over 1 million tweets, allows us to study public sentiment dynamics during this unique period at both temporal and spatial scales.
We utilized BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art NLP model developed by Google, to convert each tweet into a vector representation. BERT is particularly effective at understanding the context of a word in a sentence, which is crucial for accurate sentiment analysis. The vectors generated by BERT capture the semantic meaning of the tweet, enabling a more nuanced and detailed analysis than traditional text analysis techniques.
The vector representations from BERT were fed into a classifier for sentiment analysis. The classifier, trained on our dataset, classifies each tweet into one of three sentiment classes: -1 (negative), 0 (neutral), and 1 (positive). By aggregating these sentiment scores at the state level and over time, we created an interactive map displaying the overall sentiment in each state at different points during the pandemic. We applied a Gaussian kernel to smooth sentiment scores, reducing noise and highlighting general trends.
Our visualizations provide a unique perspective on the public response to the pandemic. The sentiment map reveals significant geographic differences in sentiment, and the time-series analysis shows key moments where public sentiment shifted noticeably. We discovered an overall negative sentiment in March 2020 and a sharp drop in June 2020, corresponding to key events during the pandemic.
Our results align with other studies analyzing Twitter data for sentiment changes over time and across locations. Notably, our findings echo those in the study by Aello et al., where an "Anger phase" coincides with our observed drop in sentiment around the beginning of March 2020, when the first COVID-19 cases arrived in the US. Additionally, the study by Yao et al., which focused on sentiment in New York and Los Angeles, aligns with our findings of a more negative sentiment in New York compared to Los Angeles.
SentPred exemplifies the power of NLP and sentiment analysis in understanding public sentiment during major global events such as the COVID-19 pandemic. The insights gained from our project can help policymakers, researchers, and public health officials to better understand public sentiment, inform decision-making processes, and develop more effective communication strategies.
For more detailed information on data preprocessing, model training, and visualization steps, please refer to the notebooks and scripts in our repository.