This topic aims at improving Naives Bayes algorithm by using feature-correlation-based weighting and improving the performance of Bayes classifier by using Adaboost.
These instructions will let you apply the project demo and running on your machine for development and testing purposes.
Following tools and packages are required to run this project demo.
Python 3
numpy
random
re
nltk
textblob
time
tweepy
os
sys
- Open the https://www.python.org/downloads/windows/
- Download the correct installation package(32 or 64 bits) for Python 3 (Latest Python 3 Release is OK)
- Execute the installation package and finish the installation (remember to select the item "Add Python x.x to PATH")
For the re and random are included in original Python, we only need to install the Numpy, NLTK and TextBlob.
- Open the http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy
- Download the correct Numpy wheel for your Python version
- Go the path of the wheel in cmd and perform the installtion
>pip3 install numpy-x.xx.x+mkl-cpxx-cpxxm-win(32 or 64 bits).whl
>pip3 install nltk
>pip3 install textblob
If there is no HomeBrew in your Mac, just install it in termianl.
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
You can use the Homebrew to search and install Python 3 in terminal.
$ brew search python
$ brew install python3
For the re and random are included in original Python, we only need to install the Numpy, NLTK and TextBlob.
$ pip3 install numpy
$ pip3 install nltk
$ pip3 install textblob
For more details about twitter data crawling api, please refer to Twitter Developer Docs
>python get_tweets [user] [datatype] [nums_line] [nums_file]
e.g. >python get_tweets Tom hist 100 3
which means use the auth key of Tom to get historical data, then write these data to the 3 files with 100 lines per file.
$ python3 get_tweets [user] [datatype] [nums_line] [nums_file]
e.g. $ python3 get_tweets Tom hist 100 3
which means use the auth key of Tom to get historical data, then write these data to the 3 files with 100 lines per file.
As the data has been prepared in the folder 'data', this step can be skip in this demo.
>python data_transfer.py
>python give_label.py
$ python3 data_transfer.py
$ python3 give_label.py
Please note the name and path of files in data_transfer.py. ( in Mac)
f_old = open('../data/TweetData_Original.txt','r')
f_new = open('../data/TweetData_Transfered.txt', 'w+')
Please note the name and path of files in give_label.py. (in Mac)
original_fileName = '../data/TweetData_Transfered.txt'
new_fileName = '../data/CleanedTweetData.txt'
changeData(original_fileName,new_fileName)
- The project uses 4 spaces per indent instead of tab.
- Pay attention to the label, which should match your training set and testing set.
As for CleanedTweetData.txt, they are 'pos' and 'neg'.
As for SMS.txt, they are 'ham' and 'spam'
def loadContentsLabels(fileName):
f = open(fileName)
labels = [] # the label for type,1 is for negtive content,0 is for positive content
contents = []
for line in f.readlines():
linedatas = line.strip().split('\t')
if linedatas[0] == 'pos': # 'ham'
labels.append(0)
elif linedatas[0] == 'neg': # 'spam'
labels.append(1)
# process the original data, filter useless string
words = process_data.cleanData(linedatas[1])
contents.append(words)
return contents, labels
You can perform the model by going the path of the folder SimpleBayes_TFIDF in terminal and running the main.py.
>cd X:\xxx\xxx\SimpleBayes_TFIDF
>python main.py
$ cd /Users/xxx/xxx/SimpleBayes_TFIDF
$ python3 main.py
For the count of test set, you can change it by the variable testCount in trainningErrorRate function in main.py in Mac.
def trainningErrorRate():
"""
: test the error rate of classification
: return errorCount and errorRate
"""
filename = '../data/CleanedTweetData.txt'
contents, labels = load_save_data.loadContentsLabels(filename)
# Cross-validation
testWords = []
testWordsType = []
testCount = 1000
......
For the source of test set, you can change it by the variable filename in main.py. Take SMS.txt for an example in Mac.
def training():
filename = '../data/SMS.txt'
...
def trainningErrorRate():
filename = '../data/SMS.txt'
...
You can perform the model by going the path of the folder Adaboost_SimpleBayes_TFIDF in terminal and running the main.py.
>cd X:\xxx\xxx\Adaboost_SimpleBayes_TFIDF
>python main.py
$ cd /Users/xxx/xxx/Adaboost_SimpleBayes_TFIDF
$ python3 main.py
For the count of test set, you can change it by the variable testCount in AdaboostTrainingWithDS function in main.py in Mac
def AdaboostTrainingWithDS(iterateNum):
"""
testing error rate of classification
:param iterateNum:
:return:
"""
filename = '../data/CleanedTweetData.txt'
contents, labels = load_save_data.loadContentsLabels(filename)
# Cross-validation
testWords = []
testWordsType = []
testCount = 1000
......
For the iteration times of the model, you can change it by the variable iterateNum in main function in main.py
if __name__ == '__main__':
AdaboostTrainingWithDS(iterateNum = 40)
For the source of test set, you can change it by the variable filename in main.py. Take SMS.txt for an example in Mac.
def AdaboostTrainingWithDS(iterateNum):
filename = '../data/SMS.txt'