Skip to content

aman-chahar/1.6M-tweets-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

1.6M-tweets-Sentiment-Analysis

Project 2: Sentiment Analysis on 1.6 Million Tweets

I recently completed a project on sentiment analysis using a dataset of 1.6 million tweets. Here's a breakdown of the code and the project:

Step 1: Setting Up Kaggle Credentials

! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

This code sets up Kaggle credentials for accessing datasets.

Step 2: Importing Twitter Sentiment Dataset

!kaggle datasets download -d kazanova/sentiment140

This code downloads the Sentiment140 dataset using the Kaggle API.

Step 3: Extracting Compressed Dataset

from zipfile import ZipFile
dataset = '/content/sentiment140.zip'

with ZipFile(dataset, 'r') as zip:
  zip.extractall()
  print('The dataset is extracted')

The dataset is extracted from the downloaded zip file.

Step 4: Importing Dependencies

import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import nltk
nltk.download('stopwords')

This code imports necessary libraries and downloads NLTK stopwords.

Step 5: Data Processing

data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1' )

columns_name = ['target', 'id', 'date','flag','user','text']
data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', names=columns_name, encoding='ISO-8859-1' )

data.replace({'target':{4:1}}, inplace=True)

The data is loaded, columns are named, missing values are checked, and the target column is converted from 4 to 1.

Step 6: Stemming

port_stem = PorterStemmer()

def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ', content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)

  return stemmed_content

data['stemmed_content'] = data['text'].apply(stemming)

The code defines a stemming function and applies it to the 'text' column, creating a new 'stemmed_content' column.

Step 7: Data Splitting and Conversion

x = data['stemmed_content'].values
y = data['target'].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=2)

vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)

The data is split into training and test sets, and textual data is converted to numerical data using TF-IDF vectorization.

Step 8: Training the Machine Learning Model

model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)

A logistic regression model is trained using the training data.

Step 9: Model Evaluation

x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(y_train, x_train_prediction)

x_test_prediction = model.predict(x_test)
testing_data_accuracy = accuracy_score(y_test, x_test_prediction)

The accuracy of the model is evaluated on both training and test data.

Step 10: Saving the Trained Model

import pickle

filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))

The trained model is saved using the Pickle library.

Step 11: Using the Saved Model

loaded_model = pickle.load(open('/content/trained_model.sav', 'rb'))

x_new = x_test[521]
prediction = model.predict(x_new)
print(prediction)

The saved model is loaded and used for predictions on new data.

#Python #DataScience #SentimentAnalysis #MachineLearning #NLP #Kaggle```

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published