This repository contains a pipeline for processing the Gotham network traffic dataset, including feature extraction, feature cleaning, and data labelling. The pipeline is designed for extensibility and reproducibility.
- Before running the pipeline, ensure you have Python 3.8+ and Tshark 4.2.2 golbally installed in your computer;*. If not, you can get python here and tshark here.
- Then, clone the repository to your PC:
$ git clone https://github.com/othmbela/gotham-network-packet-labeller.git
-
- cd into your cloned repository as such:
$ cd gotham-network-packet-labeller
- Initialise the project as such:
$ make init
- cd into your cloned repository as such:
The pipeline is divided into the following stages:
- Feature Extraction: Converts raw network traffic data (e.g., pcap files) into feature datasets.
- Feature Cleaning: Cleans and processes extracted features to ensure consistency.
- Data Labelling: Labels the cleaned datasets with attack and benign traffic labels.
- Full Pipeline: Executes all steps sequentially.
You can run each stage of the pipeline individually using the Makefile. This allows you to perform specific steps as needed:
-
Feature Extraction:
make extract_features
This will extract features from raw network traffic data.
-
Feature Cleaning:
make clean_features
This will clean and preprocess the extracted feature datasets.
-
Data Labelling:
make label_data
This will label the cleaned datasets with appropriate attack/benign classifications.
To run all stages in sequence, execute the following command:
make run_pipeline
This will run feature extraction, feature cleaning, and data labelling one after the other, automating the entire pipeline.
The pipeline expects the following directory structure:
├── bash_scripts/
│
├── data/
│ ├── raw/ # Raw network traffic data (input)
│ ├── extracted_features/ # Extracted features (output from feature extraction)
│ ├── cleaned_features/ # Cleaned features (output from feature cleaning)
│ └── labeled_data/ # Labeled data (output from labelling)
│
├── features/
├── images/
├── metadata/
├── notebooks/
│
├── scripts/
│ ├── run_cleaning.py
│ ├── run_extraction.py
│ └── run_labelling.py
│
├── src/
│ ├── __init__.py
│ ├── feature_cleaner.py
│ ├── feature_extractor.py
│ ├── labeller.py
│ └── utils.py
│
├── venv/
├── .dockerignore
├── .gitignore
├── Dockerfile
├── Makefile
├── README.md
└── requirements.txt
All the experiments were conducted using a 64-bit Intel(R) Core(TM) i7-7500U CPU with 16GB RAM in Windows 10 environment.
This project is released under the Apache 2.0 license
Othmane Belarbi
If you find this code useful in your research, please cite this article as:
@misc{belarbi2025gothamdataset2025reproducible,
title={Gotham Dataset 2025: A Reproducible Large-Scale IoT Network Dataset for Intrusion Detection and Security Research},
author={Othmane Belarbi and Theodoros Spyridopoulos and Eirini Anthi and Omer Rana and Pietro Carnelli and Aftab Khan},
year={2025},
eprint={2502.03134},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2502.03134},
}