Skip to content

Latest commit



511 lines (396 loc) · 15.8 KB

Final Project

File metadata and controls

511 lines (396 loc) · 15.8 KB
jupytext kernelspec
formats text_representation
extension format_name format_version jupytext_version
display_name language name
Python 3

Tweet Like a Politician

Using Tweets to Predict Identity Characteristics of State-Level Political Figures in the United States

Alexander Adams

PPOL628 Text as Data

Final Project


  • U.S. politics is increasingly nationalized
    • Tweets by members of congress likely only convey partisanship
  • Politics at the state level may be less polarized
    • More able to identify traits because partisanship is lower
  • Tweets may also be less focused on national culture war and more focused on real issues


  • 48,249 tweets, scraped from official, campaign, and personal accounts
  • Political office: Governor, Lieutenant Governor, Secretary of State, Attorney General, Treasurer
  • Tweet data includes date tweet was posted
  • Metadata: politician's name, state, office, and political party
  • Majority of tweets are from 2018 or later

Question 1

What topics do state-level politicians tweet about?

Task: Topic Modeling

Method: BERTopic


DVC YAML Stage: topic_model

#!dvc pull
from bertopic import BERTopic
import matplotlib.pyplot as plt
%matplotlib inline 
import numpy as np
import pandas as pd
import re
import seaborn as sn
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
import yaml
pd.options.display.max_columns = None
pd.options.display.max_colwidth = None
pd.options.display.max_seq_items = None
tweets = pd.read_csv('data/tweets.csv')
#Drop tweets not in english
tweets = tweets.loc[tweets['language'] == 'en']
tweets['tweet'] = tweets['tweet'].str.replace(r'http\S+', '')
tweets = tweets.loc[tweets['tweet'] != '']
tweets = tweets.reset_index(drop=True)

Topic Modeling: What Do State-Level Elected Officials Tweet About?

topic_model = BERTopic.load('project_BERTopic')
topics_list = topic_model.get_topics()
#Static image in case real plot doesn't load
from IPython.display import Image
probs = topic_model.hdbscan_model.probabilities_
topics = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
new_topics, new_probs = topic_model.reduce_topics(tweets['tweet'], topics, probs, nr_topics = 10)


dynamic_topics = topic_model.topics_over_time(tweets['tweet'],
#Static image in case full plot does not load
                                       width = 950)


  • Most topics spike in winter
    • Ex. Topic 0 (vote/ballot = elections), Topic 8 (veterans), Topic 1 (Christmas)
  • Topics related to COVID-19 also spiked around the same time as major waves (esp. Omicron)
  • Some iterations of this graph generated during testing included Ukraine topic
    • Basically nonexistent until Feb 2022, then huge spike

Question 2: Can Tweets be used to predict the state an official leads?

Task: Multiclass Classification (State)

Method: Linear Support Vector Classifier

Number of Classes: 50 (U.S. States)


DVC YAML Stage: multiclass_state

import joblib
import numpy as np
from sklearn.metrics import (confusion_matrix, precision_recall_fscore_support, classification_report)
#Load the trained multiclass pipeline
pipe = joblib.load('outputs/mc_state_pipe.pkl')
#Perform necessary data processing
states = pd.read_csv('data/elected_officials.csv')

states = states.melt(id_vars = ['State',
                    value_vars = ['officialTwitter',
                    var_name = 'account_type',
                    value_name = 'twitter')

states['twitter'] = states['twitter'].str.lower()

tweets = tweets.merge(states, left_on = 'username', right_on = 'twitter')

#Create numeric labels based on state names

#Merge labels into MTG data frame
labels = pd.DataFrame(tweets['State'].unique()).reset_index()
#Add one because zero indexed
labels['index'] = labels['index']+1
labels.columns = ['state_label', 'State']
tweets = tweets.merge(labels, on = 'State')
#Select labels as targets
y = tweets['state_label']

#Select text columns as features
X = tweets["tweet"],y)
y_pred = pipe.predict(X)

Rather than print out a 50x50 confusion matrix, I'm going to simplify the matrix to just a few columns:

-state: the abbreviation for the state
-correct: the number of correctly classified tweets for that state
-incorrect: the number of incorrectly classified tweets for that state
-errors: the labels which were applied incorrectly for each state
-precision: true positives/(true positives + false positives)
-recall: true positives/(true positives + false negatives)
-errors: the state labels which were generated as false negatives
cm = confusion_matrix(y,y_pred)
state_cm = pd.DataFrame.from_dict({'state': pd.unique(tweets['StateAbbr']),
                                   'correct': np.diag(cm),
                                   'incorrect': cm.sum(1)-np.diag(cm),
                                   'total_tweets': cm.sum(1),
                                   'precision': np.diag(cm)/cm.sum(0),
                                   'recall': np.diag(cm)/cm.sum(1)})
cm = pd.DataFrame(cm)
cm.columns = pd.unique(tweets['StateAbbr'])
cm.index = pd.unique(tweets['StateAbbr'])
cols = cm.columns.values
mask =
np.fill_diagonal(mask, False)
out = [cols[x].tolist() for x in mask]
state_cm['errors'] = out


  • No apparent regional trends in errors
    • i.e. Southern states (like AL) were no more likely to be misclassified as other southern states than as states in other parts of the country
    • Possible that creating region labels would not improve performance
  • Consistently strong performance across states
    • All precision and recall scores > 0.9, most are 0.98 or greater
    • Lowest scores are recall for California and Colorado

Question 3: Can I predict the office a politician holds?

Task: Multiclass Classification (Political Office)

Method: Linear Support Vector Classifier

Number of Classes: 5 (Governor, Lieutenant Governor, Attorney General, Secretary of State, Treasurer)


DVC YAML Stage: multiclass_office

#Load the trained multiclass pipeline
pipe = joblib.load('outputs/mc_office_pipe.pkl')
labels = pd.DataFrame(tweets['office'].unique()).reset_index()
#Add one because zero indexed
labels['index'] = labels['index']+1
labels.columns = ['office_label', 'office']
tweets = tweets.merge(labels, on = 'office')
#Select labels as targets
y = tweets['office_label']

#Select text columns as features
X = tweets["tweet"],y)
y_pred = pipe.predict(X)
ConfusionMatrixDisplay.from_predictions(y, y_pred, display_labels = pd.unique(tweets['office']))
cm = confusion_matrix(y,y_pred)
office_cm = pd.DataFrame.from_dict({'office': pd.unique(tweets['office']),
                                   'correct': np.diag(cm),
                                   'incorrect': cm.sum(1)-np.diag(cm),
                                   'total_tweets': cm.sum(1),
                                   'precision': np.diag(cm)/cm.sum(1),
                                   'recall': np.diag(cm)/cm.sum(0)})


  • Fewer classes, but overall a less effective classifier
    • Esp. Lt. Governors (precision = 0.888)
    • Maybe Lt. Governors have less distinctive tweets than other state-level officials?
  • Mean recall is slightly higher than mean precision
    • Classifier is better at avoiding false negatives than false positives
  • Classes are imbalanced; count(governor) = 1.5x/2x count(other offices)

Question 4: Can I predict the political party of a state-level political official?

Task: Binary Classification (Political Party)

Method: Linear Support Vector Classifier

Number of Classes: 2 (Democrat, Republican)


DVC YAML Stage: twoclass_party

Note: 2 officials are Independents, and were excluded from this model. In Minnesota, the Democratic party is called the Democratic Farmer-Labor party (DFL); politicians in that party were recoded as Democrats.

#Load the trained multiclass pipeline
pipe = joblib.load('outputs/bc_party_pipe.pkl')
labels = pd.DataFrame(tweets['Party'].unique()).reset_index()
#Add one because zero indexed
labels['index'] = labels['index']+1
labels.columns = ['party_label', 'Party']
tweets = tweets.merge(labels, on = 'Party')
partyclass = tweets.loc[tweets['Party'] != 'Independent']
#Select labels as targets
y = partyclass['party_label']

#Select text columns as features
X = partyclass["tweet"],y)
y_pred = pipe.predict(X)
ConfusionMatrixDisplay.from_predictions(y, y_pred, display_labels = pd.unique(partyclass['Party']))
print(classification_report(y, y_pred, target_names=pd.unique(partyclass['Party'])))


  • Classifier can predict if a tweet was tweeted by a Republican or a Democrat with 97% accuracy
  • Strong evidence that the two parties do tweet differently
    • Suggests initial hypothesis (state-level politics is not as polarized/nationalized as federal politics) is not true
      • At least not on Twitter

Question 5: Can I predict how partisan an elected official is, based on their tweets?

Task: Ideal Point Generation

Method: Wordfish (via R packages quanteda and quanteda.textmodels)

Output: Value indicating ideological position on left-right scale (further right = more conservative)

Script: ideal_points.R

Note: I was only able to find ideal points for governors and state treasurers. For Lt. Governors, Secretaries of State, and Attorneys General, the algorithm did not converge.

Ideal points of governors


Ideal points of state treasurers:



  • Governors are more polarized than treasurers
    • Even then, governors are not completely polarized (Ex. Charlie Baker (MA), Jim Justice (WV))
  • Polarization could be linked to visibility
    • Officials from TX, FL tend to be at extremes
      • Do tweets make them more polarizing, or are tweets byproduct of polarization?

Conclusions and avenues for further exploration:

  • Incorporate additional data (margin of victory in most recent election, partisanship of state)
  • Consider length of incumbency
    • Wanted to test pre-/post-inauguration, but ran out of time

Thank you for listening! Any questions?