GothamDataset2025 Processing Pipeline

This repository contains a pipeline for processing the Gotham network traffic dataset, including feature extraction, feature cleaning, and data labelling. The pipeline is designed for extensibility and reproducibility.

Installation

Before running the pipeline, ensure you have Python 3.8+ and Tshark 4.2.2 golbally installed in your computer;*. If not, you can get python here and tshark here.

Then, clone the repository to your PC:

    $ git clone https://github.com/othmbela/gotham-network-packet-labeller.git

Dependencies
1. cd into your cloned repository as such:
```
    $ cd gotham-network-packet-labeller
```
2. Initialise the project as such:
```
    $ make init
```
First, the command line will create your vcirtual environment and install the dependencies needed to run the app. Then, it will create the data folders. 3. Move the GothamDataset2025 to the ./data/raw folder.

Pipeline Tasks

The pipeline is divided into the following stages:

Feature Extraction: Converts raw network traffic data (e.g., pcap files) into feature datasets.
Feature Cleaning: Cleans and processes extracted features to ensure consistency.
Data Labelling: Labels the cleaned datasets with attack and benign traffic labels.
Full Pipeline: Executes all steps sequentially.

Usage

Running Individual Steps

You can run each stage of the pipeline individually using the Makefile. This allows you to perform specific steps as needed:

Feature Extraction:
```
make extract_features
```
This will extract features from raw network traffic data.
Feature Cleaning:
```
make clean_features
```
This will clean and preprocess the extracted feature datasets.
Data Labelling:
```
make label_data
```
This will label the cleaned datasets with appropriate attack/benign classifications.

Running the Full Pipeline

To run all stages in sequence, execute the following command:

make run_pipeline

This will run feature extraction, feature cleaning, and data labelling one after the other, automating the entire pipeline.

Files and Folders Structure

The pipeline expects the following directory structure:

    ├── bash_scripts/
    │
    ├── data/
    │   ├── raw/                     # Raw network traffic data (input)
    │   ├── extracted_features/      # Extracted features (output from feature extraction)
    │   ├── cleaned_features/        # Cleaned features (output from feature cleaning)
    │   └── labeled_data/            # Labeled data (output from labelling)
    │
    ├── features/
    ├── images/
    ├── metadata/
    ├── notebooks/
    │
    ├── scripts/
    │   ├── run_cleaning.py
    │   ├── run_extraction.py
    │   └── run_labelling.py
    │
    ├── src/
    │   ├── __init__.py
    │   ├── feature_cleaner.py
    │   ├── feature_extractor.py
    │   ├── labeller.py
    │   └── utils.py
    │
    ├── venv/
    ├── .dockerignore
    ├── .gitignore
    ├── Dockerfile
    ├── Makefile
    ├── README.md
    └── requirements.txt

Requirements

All the experiments were conducted using a 64-bit Intel(R) Core(TM) i7-7500U CPU with 16GB RAM in Windows 10 environment.

License

This project is released under the Apache 2.0 license

Authors

Othmane Belarbi

Citation

If you find this code useful in your research, please cite this article as:

@misc{belarbi2025gothamdataset2025reproducible,
      title={Gotham Dataset 2025: A Reproducible Large-Scale IoT Network Dataset for Intrusion Detection and Security Research}, 
      author={Othmane Belarbi and Theodoros Spyridopoulos and Eirini Anthi and Omer Rana and Pietro Carnelli and Aftab Khan},
      year={2025},
      eprint={2502.03134},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2502.03134}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GothamDataset2025 Processing Pipeline

Table of Contents

Installation

Dependencies

Pipeline Tasks

Usage

Running Individual Steps

Running the Full Pipeline

Files and Folders Structure

Requirements

License

Authors

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

GothamDataset2025 Processing Pipeline

Table of Contents

Installation

Dependencies

Pipeline Tasks

Usage

Running Individual Steps

Running the Full Pipeline

Files and Folders Structure

Requirements

License

Authors

Citation