Natural Language Processing of Cyberbullying Tweets

Cyberbullying is a serious problem that can have detrimental effects on individuals' mental health and well-being. Given the large volume of tweets generated daily, manually identifying cyberbullying is time-consuming and inefficient. This tool has been developed to address this challenge by efficiently flagging potentially harmful tweets, aiming to create a safer online environment for all.

Used Kaggle's Cyberbullying Dataset with under-sampling.
Cleaned and transformed data for machine learning.
Trained various models using cross-validation.
Built a user-friendly API with Flask.

Code and Resources

Python Version: 3.10
Packages: numpy, pandas, nltk, scikit-learn, xgboost, flask, json, pickle
Original Dataset: https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification
Setting Up Environment:

conda create -p venv python=3.10 -y
pip install -r requirements.txt

Getting Data

The dataset utilized, Cyberbullying Classification Dataset from Kaggle. This dataset is tailored for binary classification, distinguishing potentially harmful tweets from non-cyberbullying ones.

Data Cleaning

A Python script is employed for comprehensive text data cleaning, encompassing operations such as:

Punctuation removal
Numeric character removal
Lowercasing
Stop word elimination
Lemmatization/Stemming
URL removal

Model Building

Data Splitting: Segregation of data into training and testing sets with a 70-30 ratio.
Feature Engineering: Utilization of a transformation pipeline integrating CountVectorizer and TfidfTransformer.
Model Training and Evaluation: Training multiple models via cross-validation, with selection based on accuracy and training efficiency. The XGBClassifier model is chosen for its superior performance.
Fine-tuning: Refinement of the XGBClassifier model for optimal performance.

After cross-validation, the models show the following performances:

Productionization

The project culminates in the development of a user interface facilitated by Flask, along with an API endpoint for real-time cyberbullying detection.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
artifacts		artifacts
config		config
notebooks		notebooks
reports/figures		reports/figures
src/cbDetection		src/cbDetection
static/css		static/css
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
params.yaml		params.yaml
request.py		request.py
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing of Cyberbullying Tweets

Code and Resources

Getting Data

Data Cleaning

Model Building

Productionization

About

Releases

Packages

Languages

polaternez/cyberbullying-tweet-detection-binary

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing of Cyberbullying Tweets

Code and Resources

Getting Data

Data Cleaning

Model Building

Productionization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages