Malicious-Webpage-Classifier

Objective

Malicious web pages are designed to install malware on your system, disrupt computer operations, and, in many cases, steal personal information. Classifying these web pages is crucial for enhancing user safety and providing a secure browsing experience.

This project aims to classify web pages into Malicious[Bad] and Benign[Good]. Through extensive Exploratory Data Analysis (EDA) and Geospatial Data Analysis, valuable insights were derived to understand the data better. The dataset underwent feature engineering and preprocessing to ensure the optimal performance of the models. Three Machine learning and one Deep Learning model are trained. The models are XGBoost, Logistic Regression, Decision Tree and Deep Neural Network. The Deep Neural Network is implemented in PyTorch and the others are implemented using scikit-learn.

Dataset

The data set is taken from Mendeley Data. The dataset contains features like the raw webpage content, geographical location, javascript length, obfuscated JavaScript code of the webpage etc. The Dataset contains around 1.5 million web pages. A description of the whole dataset is provided at the link provided.

File Structure

├── config
│   └── config.yaml
├── data
│   ├── dataset.txt
│   └── tableconvert_csv_pkcsig.csv
├── deployment
│   ├── config_loader.py
│   └── deployment.py
├── notebooks
│   ├── Exploratory Data Analysis.ipynb
│   └── Modelling.ipynb
├── output
│   ├── encoders
│   ├── models
│   └── scalers
├── scripts
│   ├── __init__.py
│   ├── config_loader.py
│   ├── model_dispatcher.py
│   ├── predict.py
│   ├── preprocessing.py
│   └── train.py
├── src
│   ├── __init__.py
│   ├── config_loader.py
│   ├── cross_val.py
│   ├── dataset.py
│   ├── default_accuracy.py
│   ├── domain_functions.py
│   ├── dumper.js
│   ├── eval_metrics.py
└── └── jsado.py

Files

config: Configuration file to config the whole project
- config.yaml - Configuration file
data: Contains the input files for the project
- tableconvert_csv_pkcsig.csv - Contains the iso alpha3 code for the countries
- dataset.txt - Link to download the dataset and paste it into the input folder
deployment: Contains the deployment code for the project
- config_loader.py - code to ingest the configs in the files
- deployment.py - deployment code on localhost using PyWebIO and Flask
output/encoders and scalers: Contains all the saved Label Encoder and Standard Scaler files for preprocessing
- content_len_ss.pkl - Standard Scaler for content len
- geo_loc_encoder.pkl - Label Encoder for geolocation
- https_encoder.pkl - Label Encoder for HTTPS features
- net_type_encoder.pkl - Label Encoder for the network type
- special_char_ss.pkl - Standard Scaler for Special Char length
- tld_encoder.pkl - Label Encoder for Top-Level Domain
- who_is_encoder.pkl - Label Encoder for who_is status
output/models: Contains all the trained models [DNN, LR, DT, XG]
notebooks: Contain all the notebooks -- Kaggle Notebook
scripts: Contains all the code used for the main scripts
- config_loader.py - code to ingest the configs in the files
- model_dispatcher.py - Contains the ML and DL models
- predict.py - Python file to make a prediction
- preprocessing.py - Python file for preprocessing the dataset for training and testing
- train.py - Main run file
src: Contains all the code used for the main scripts
- config_loader.py - code to ingest the configs in the files
- eval_metrics.py - Evaluation Metrics for the training and testing
- cross_val.py - Crossvalidation code [StratifiedKFold]
- dataset.py - Contains the code for a custom dataset for the PyTorch DNN mode
- domain_function.py - Contains several functions to extract the features of the dataset if a particular feature is not given
- jsado.py, dumper.js - Code to find the Obfuscated JS code if the feature is not given Github
requirement.txt: Packages required for the project to run.
LICENSE: License File

Exploratory Data Analysis

EDA and GDA are done on the dataset to get the maximum insights about the data and engineering features accordingly like the Distribution of the Malicious Webpages around the world on a choropleth map, Kernel Density Estimation of the Javascript Code and more.

The Exploration Notebook is given in the notebooks folder.

Preprocessing

The preprocessing is done on the data to make it ready for the modelling part and features engineering is done as well. First, several features are added to the dataset like the length of the content, count of the special characters in the raw content, and Type of the network (A, B, C) according to the IP address.

The Categorical features are converted into numeric values and the normalization is done on the content length and the count of the special characters using the Standard Scaler from scikit learn. Some features are removed and are not used in the training.

The preprocessing functions and the code is given in the src folder.

Models

A total of four models are used in the project named, XGBoost, Logistic Regression, Deep Neural Network and Decision Tree. The models are trained, validated and tested using 5 5-fold Cross-validation set [Stratified k folds]. The structure of the models is given in the src/model_dispatcher.py file. The best performing model was the XGBoost Classifier followed by Deep Neural Network. The trained models are given in the models folder which can be used for predictions or can be trained again with new features.

The Notebook containing the modelling and the results are given in the notebooks folder.

How to Run

Installing the required libraries

Install all the required libraries given in the requirement.txt file.

Downloading the dataset

Download the dataset given in the dataset.txt file and place the data in the data/ folder.

Using Command Prompt

Run these commands in the main directory of the project to train the models and do prediction with deployment.

Training

New models can be trained using the train.py with different folds.

python3 scripts/train.py --folds [fold] --model [model name]

folds - [0, 4]
model - [xg, dt, lr, dnn]

dnn - Deep Neural Network
xg - XGBoost
dt - Decision Tree
lr - Logistic Regression

Predictions

Predictions can be made using the trained models. This can be done using the predict.py.

python3 scripts/predict.py --path [path] --model [model]

path - path of the testing data
model - The trained model in the output/models folder [xg, dt, dnn, lr]

Deployment

Run the deployment/deployment.py. The default browser will open the locally deployed app 'http://localhost:5000/maliciousWPC'

URL : URL of the web page.
Geographical Location : geo Loc of the web page [Choose 'other' if don't know -- The code will extract it]
IP Address : IP Address of the web page [Leave as it is if don't know -- The code will extract it]
Top Level Domain : TLD of the web page [Choose 'other' if don't know -- The code will extract it]
Prediction models : The model to be used for prediction.

The output page contains the 'WHO IS' Information of the webpage and the prediction Malicious or Benign.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malicious-Webpage-Classifier

Objective

Dataset

File Structure

Files

Exploratory Data Analysis

Preprocessing

Models

How to Run

Installing the required libraries

Downloading the dataset

Using Command Prompt

Training

Predictions

Deployment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
config		config
data		data
deployment		deployment
notebooks		notebooks
output		output
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

SumitM0432/Malicious-Webpage-Classifier

Folders and files

Latest commit

History

Repository files navigation

Malicious-Webpage-Classifier

Objective

Dataset

File Structure

Files

Exploratory Data Analysis

Preprocessing

Models

How to Run

Installing the required libraries

Downloading the dataset

Using Command Prompt

Training

Predictions

Deployment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages