MuDBSCAN

A fast, exact, and scalable algorithm for DBSCAN clustering. This repository contains the implementation for the distributed spatial clustering algorithm proposed in the paper μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality Link to paper - https://adityaas.github.io/documents/MuDBSCAN_CLUSTER19.pdf

We propose an extremely efficient way to compute neighbourhood queries that not only improves the average time complexity but exhibits super-linear speed up on large astronomical datasets. Using the distributed variant of our algorithm, we were able to cluster 1 billion 3D points in under 42 minutes

To cite our work please use

A. Sarma et al., "μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality," 2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 2019, pp. 1-11, doi: 10.1109/CLUSTER.2019.8891020.

Setup

Clone the repository
Install dependencies (gcc/g++, open-mpi)
To run the distributed variant of the algorithm, a MPI cluster has to be setup

Running the algorithm

Re-format your input according to the template below and store the file in a folder datasets

<number_of_data_points>
<dimension>
<data_1_dim_1> <data_1_dim_2> ... <data_1_dim_dimension>
.
.
.
...

For example

Running the sequential algorithm

./runs.sh <dataset> <epsilon> <minpts> <MinDegree Rtree> <MaxDegree Rtree>

where
- <dataset> is the name of the file formatted according to 1.
- <epsilon> represents the neighbourhood parameter (Anything within epislon distance from given point is considered a neighbour)
- <minpts> represents the density parameter. It defines the minimum number of neighbours required for a point to be classified as `dense`
- <MinDegree Rtree> and <MaxDegree Rtree> hyper-parameters and correspond to the minimum and maximum degree of the custom defined μC-RTree

Running the distributed algorithm

./rund.sh <dataset> <epsilon> <minpts> <nodes> <hostfile> <MinDegree Rtree> <MaxDegree Rtree>

where <dataset>, <epsilon>, <minpts>, <MinDegree, Rtree>, <MaxDegree Rtree> are same as before and
- <nodes> number of nodes to use within the cluster
- <hostfile> list of nodes configured in the server (hostnames)

Overview

Overview of the Algorithm

Time complexity

Results

Proposed sequential algorithm compared with existing sequential clustering algorithms
Proposed algorithm compared with existing clustering algorithms on 32 nodes
Run-time split up across various steps in μDBSCAN Sequential algorithm split up

Distributed algorithm split up

Speed up across various steps in μDBSCAN
Peak memory consumption of μDBSCAN
Scalability of μDBSCAN

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
.gitignore		.gitignore
Data.cpp		Data.cpp
Data.h		Data.h
Def.h		Def.h
GList.cpp		GList.cpp
GList.h		GList.h
LICENSE		LICENSE
Makefile		Makefile
MuC.cpp		MuC.cpp
MuC.h		MuC.h
MuC_RTree.cpp		MuC_RTree.cpp
MuC_RTree.h		MuC_RTree.h
README.md		README.md
RList.cpp		RList.cpp
RList.h		RList.h
RTree.cpp		RTree.cpp
RTree.h		RTree.h
clustering.cpp		clustering.cpp
clustering.h		clustering.h
dataParallel.h		dataParallel.h
main.cpp		main.cpp
malloc_count.cpp		malloc_count.cpp
partition.cpp		partition.cpp
partition.h		partition.h
rund.sh		rund.sh
runs.sh		runs.sh
vectorc.cpp		vectorc.cpp
vectorc.h		vectorc.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuDBSCAN

Setup

Running the algorithm

Overview

Overview of the Algorithm

Time complexity

Results

About

Releases

Packages

Languages

License

AdityaAS/MuDBSCAN

Folders and files

Latest commit

History

Repository files navigation

MuDBSCAN

Setup

Running the algorithm

Overview

Overview of the Algorithm

Time complexity

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages