Skip to content

Latest commit

 

History

History
130 lines (91 loc) · 5.37 KB

weka.org

File metadata and controls

130 lines (91 loc) · 5.37 KB

Weka

Download from the Weka website, click “Download” on the left, and select one of the “Stable book 3rd ed.” versions for your machine.

You may need to increase the memory available to Weka. On Windows, check out Weka’s page on the topic. On a Mac, use this command in the Terminal:

java -Xmx4096m -jar /Applications/weka-3-6-8.app/Contents/Resources/Java/weka.jar

k-means clustering

./images/weka-cluster-1.png

./images/weka-cluster-2.png

./images/weka-cluster-3.png

./images/weka-cluster-4.png

./images/weka-cluster-5.png

k-nearest neighbor classification

./images/weka-classify-1.png

./images/weka-classify-2.png

./images/weka-classify-3.png

./images/weka-classify-4.png

./images/weka-classify-5.png

Preprocessing text

As discussed in the text classification notes, text files typically need to be converted into “feature vectors” before machine learning algorithms can be applied. The most common feature vector is a vector where each dimension is a different word, and the value in that dimension is either 0/1 binary value (yes or no the word is in that document), an integer count (the frequency of the word in that document), or a real value (often the tf-idf score of that word in that document).

The ARFF files we will be working with will always look like this:

@relation some-description-of-the-data
@attribute contents string
@attribute class {ham,spam}

@data
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',ham
'Ok lar... Joking wif u oni...',ham
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&Cs apply 08452810075over18s',spam
'U dun say so early hor... U c already then say...',ham
'Nah I dont think he goes to usf, he lives around here though',ham
...

Of course, the classes (ham/spam) may change. These files are very easy to generate, should you actually wish to use text classification for your own purposes.

We’ll practice with the sms-spam.arff file, which has examples of SMS text msg ham and spam. This collection was produced by Tiago A. Almeida from the Federal University of Sao Carlos and José María Gómez Hidalgo from the R&D Department of Optenet. More information available here. You may find their paper interesting; it examines the performance of various machine learning algorithms on this data set.

Strings to binary feature vectors

./images/weka-preprocess-1.png

./images/weka-preprocess-2.png

./images/weka-preprocess-3.png

./images/weka-preprocess-4.png

./images/weka-preprocess-5.png

./images/weka-preprocess-6.png

./images/weka-preprocess-7.png

./images/weka-preprocess-8.png

Strings to frequency vectors

Follow the steps above (“Strings to binary feature vectors”) except use the following changes in the StringToWordVector filter:

./images/weka-preprocess-9.png

Strings to tf-idf vectors

Follow the steps above (“Strings to binary feature vectors”) except use the following changes in the StringToWordVector filter:

./images/weka-preprocess-10.png