This repository provides the code necessary to design an graphical analysis of the best sampling methods for imbalanced classification. With it, a binary synthetic data set with a chess board distribution can be constructed with the number of instances and the imbalanced ratio desired, tuning the parameters of the program. This dataset is preprocessed by the most relevant methods published in Python and CRAN of R. The results are plotted together with the classification surfaces inferred by the Scikit-Learn's decision tree.
The repository contains the following files:
-
plot_synthetic.py generates the synthetic data and executes all the sampling methods of the imblearn package. Its parameters goes as followed:
- 1st parameter (div): shape of the chess board.
- 2nd parameter (N): number of instances for the balanced dataset (N/2 for each class).
- 3rd parameter (per): percentage of instances that conform the imbalanced data set (value in [0,1]).
-
plot_syntheticMWMOTE.py does the same as plot_synthetic.py, but with the MWMOTE method.
-
MWMOTE.py implements the MWMOTE method provided in its GitHub repo.
-
smotesData.R executes other important over-sampling methods implemented in the smotefamily package of R and ROSE
-
plot_file.py plots the results obtained with smotesData.R, giving the generated files as parameter.
Starting from the a 4x4 Chess data with 1000 instances and 10% of the minority class (div=5; N=1000; per=0.1):
- ADASYN (imblearn package, default parameters)
- BLSMOTE (smotefamily R package, default parameters)
- DBSMOTE (smotefamily R package, default parameters)
- MWMOTE (MWMOTE GitHub repo, #Synthetic(N)=400)
- ROSE (ROSE R package, hmult.majo=0.1, hmult.mino=0.1)
- RSLS (smotefamily R package, default parameters)
- SLS (smotefamily R package, default parameters)
- SMOTE (imblearn package, default parameters)
- SMOTEENN (imblearn package, default parameters)
- SMOTETomek (imblearn package, default parameters)
- IHT (imblearn package, default parameters)
- NCL (imblearn package, n_neighbors=20)
- OSS (imblearn package, k=1, n_seeds_S=100)