This repository contains my personal data science and machine learning projects. The key projects included are:
- Audio data classification: The Spotify API has been used to produce a dataset of c. 5,000 songs. In this project, I have implemented machine learning models to classify the songs by genre using the audio features of the songs in the dataset.
- NIPS publications analysis: The purpose of this project is to use data on published academic papers to investigate trends in machine learning research over recent decades (1987-2017). The data is an archive of 7,241 publications from the Neural Information Processing Systems (NIPS) conferences. I use visualisation to show the change in the number of publications over time, as well as Latent Dirichlet Allocation to investigate which are the most prevalent machine learning topics, and how these vary from year to year.
- Real vs fake news prediction: I take a dataset of real and fake news articles and, with feature engineering and dimensionality reduction, I develop a model to predict whether an article is real or fake, with an accuracy score of 99%.
- Credit card fraud detection: Using a set of financial transactions and associated data, I develop a classification model which can determine whether a credit card transaction is fraudulent or not, with an AUC ROC score of 99.7%.
- Consumer complaints interpretation: Using a large dataset of complaints written by consumers, I implement natural language processing techniques including lemmatization, tf-idf and count vectorization, and scikit-learn classification models to categorise the complaints by the issue to which they relate. The model currently has an accuracy of 53% classification rate on 90 possible classes.
- Machine learning algorithms from scratch: In this project, I develop machine learning algorithms from scratch using only low-level Python, Numpy, and Pandas functionality. The algorithms developed to date include linear regression and an artificial neural network. The architecture and hyperparameters of the neural network are flexible, allowing for variation of the number of hidden layers, nodes in each hidden layer, learning rate, and activation function. The neural network is trained using stochastic gradient descent.
The code in this repository is written in Python, using Jupyter Notebooks. I use a number of well known packages and APIs including Numpy, Pandas, Matplotlib, Seaborn, and Scikit-Learn.