This project uses Phishing Websites dataset from UCI machine learning Datasets. The objective is to identify whether a website is a Phishing website one or not.
There are 31 columns in the dataset, containing 30 features and 1 target. In total there are 2456 observations in the dataset. I have used 75% of observations(1843) as the training set and remaining(613) for test set. Here is a list of all the attributes in the dataset, along with their possible values and column names used:
Attributes | Values | Column Name |
---|---|---|
Having IP Address | { 1,0 } | has_ip |
Having long url | { 1,0,-1 } | long_url |
Uses ShortningService | { 0,1 } | short_service |
Having '@' Symbol | { 0,1 } | has_at |
Double slash redirecting | { 0,1 } | double_slash_redirect |
Having Prefix Suffix | { -1,0,1 } | pref_suf |
Having Sub Domain | { -1,0,1 } | has_sub_domain |
SSLfinal State | { -1,1,0 } | ssl_state |
Domain registeration length | { 0,1,-1 } | long_domain |
Favicon | { 0,1 } | favicon |
Is standard Port | { 0,1 } | port |
Uses HTTPS token | { 0,1 } | https_token |
Request_URL | { 1,-1 } | req_url |
Abnormal URL anchor | { -1,0,1 } | url_of_anchor |
Links_in_tags | { 1,-1,0 } | tag_links |
SFH | { -1,1 } | SFH |
Submitting to email | { 1,0 } | submit_to_email |
Abnormal URL | { 1,0 } | abnormal_url |
Redirect | { 0,1 } | redirect |
on mouseover | { 0,1 } | mouseover |
Right Click | { 0,1 } | right_click |
popUp Window | { 0,1 } | popup |
Iframe | { 0,1 } | iframe |
Age of domain | { -1,0,1 } | domain_age |
DNS Record | { 1,0 } | dns_record |
Web traffic | { -1,0,1 } | traffic |
Page Rank | { -1,0,1 } | page_rank |
Google Index | { 0,1 } | google_index |
Links pointing to page | { 1,0,-1 } | links_to_page |
Statistical report | { 1,0 } | stats_report |
Result | { 1,-1 } | target |
All the attributes having a binary value space are generally denoting the absence or presence of respective attribute. Attributes with three possible values are generally representing the strength(low, medium, high).
Identification of the possible phishing websites is done in R
with caret
.
- The R script - phishing.R initially load the required libraries and the dataset from phishing.csv file
- Column names are set using
names
array(as shown in codebook above) - Dataset is then split into training and test set useing caret's
createDataPartition
method - Then three different models are applied on the training dataset -
boosted Logistic Regression
,SVM with RBF Kernel
,Tree Bag
- For each model we get the
confusionMatrix
after predicting the samples from test set
- Ipython Notebooks\ - contains ipython notebooks used with
BigML
and to paritition train and test set - Datasets\ - contains CSV Data files used in BigML and R Script
- attributes.txt - contains info about the attributes in Dataset
- phishing.R - R Script to apply treebag model(similar to BigML-ensemble)
- Conclusion.pdf - Anwer for - do you think these predictions are good?
- BigML_classification.py - Python Script for calling and running ensemble model on BigML API
- BigML_summary.txt - Summary of BigML model
I was able to get 96.4% accuracy with the treebag
model. Here is a plot for the variable importance in the tree bag model.