Classification of news articles from various sources as left or right leaning using text from article headline and article content. Final project for JHU 605.744 - Information Retrieval (Fall 2021).
Commonly used machine learning algorithms are used to create classification models to predict the political leaning of news articles sourced from a variety of publishers across the bias and reliability spectrum, as defined by AdFontes Media. Examples of machine learning algorithms employed herein include Decision Trees, Logistic Regression, and Support Vector Machines. The news article data was sourced by scraping the websites of select publishers over a period of two weeks. The data collection effort culminated in 1,494 unique news articles. Three approaches for making predictions were investigated - (1) using only data such as headline length, article content length, predicted news topic, and polarity, (2) using only the headline text, and (3) using only the article content text. All text data is vectorized using TF-IDF vectorization. Results show that the Random Forest classification is the highest performing of all of the tested models in each of the three classification approaches. The results also show that the highest quality predictions may be achieved by using the article content text to make predictions of political leaning.