Question: Combining these with Undersampling #62

BradKML · 2022-09-15T01:49:47Z

BradKML
Sep 15, 2022

SMOTE variants can be used with Undersamplers to speed up classification of imbalanced datasets. However oversampling normally precedes undersampling. Is it possible to generate minority samples that are less than the majority?
scikit-learn-contrib/imbalanced-learn#925

gykovacs · 2022-09-15T05:56:50Z

gykovacs
Sep 15, 2022
Maintainer

For numerous oversampling techniques it is certainly possible. There are a bunch of oversampling algorithms, e.g. SMOTE_PSO, etc., which do optimize the number of samples being generated. With these techniques it is up to the algorithm how many minority samples will be generated in the end. However, in many cases one can set the number of samples to be generated through the proportion parameter of the oversampling class.

Namely, let N_min and N_maj denote the number of minority and majority samples, thus, the difference is N_maj - N_min. The proportion parameter specifies the number of samples to be generated in terms of this difference. Particularly, proportion * (N_maj - N_min) samples will be generated. For example, if proportion is set to 1, then the class label distribution will be equalized as the number of minority samples will match the number of majority samples after oversampling. If proportion is set to less than 1, then less samples are generated.

If you want to generate a certain number of samples, for example, 10 additional minority samples are desired, then you can set the proportion parameter to 10 / (N_maj - N_min).

The proportion parameter is supported by more than 60 oversampling techniques in the smote-variants package.

0 replies

BradKML · 2022-09-16T04:58:47Z

BradKML
Sep 16, 2022
Author

Sorry for asking, but in the API doc it is not noted which of the 80+ oversampling techniques supports proportion.

Only noted it can be used https://smote-variants.readthedocs.io/en/latest/examples.html
Does not note what parameter is supported https://smote-variants.readthedocs.io/en/latest/oversamplers.html
Lack of notes for combining oversamplers and filters https://smote-variants.readthedocs.io/en/latest/noise_filters.html

If you want to generate a certain number of samples, for example, 10 additional minority samples are desired, then you can set the proportion parameter to 10 / (N_maj - N_min).

What if I want to 5x or 10x minority samples before using an undersampler?

0 replies

gykovacs · 2022-09-16T07:20:09Z

gykovacs
Sep 16, 2022
Maintainer

That's a good point. I have just created a release (0.7.1) with an additional query function 'get_proportion_oversamplers' to get all oversampler classes with proportion parameters:

import smote_variants as sv

prop_oversamplers = sv.get_proportion_oversamplers() # list of all oversampler classes with proportion parameters

Also please note that despite having a proportion parameter, it might be inaccurate as some oversampling techniques change the number of majority samples (e.g. by noise filtering). Those which use proportion accurately (do not change the majority samples) are exactly the ones which are suitable for multiclass oversampling. You can query these by

import smote_variants as sv

extensive_oversamplers = sv.get_multiclass_oversamplers() # the list of all oversamplers having a proportion parameter and only extending the set of minority samples (leaving the majority samples intact)

Regarding the combination of oversamplers and filters, it is completely up to the user how he combines them. There are some oversampling techniques which inherently contain some noise filter (like SMOTE_TomekLinks). As these noise filters are used in multiple oversampling techniques, they have been put into a separate module for the ease of reuse. However, one can use these prior to oversampling or on the result of oversampling without any restriction, any pipeline of noise filters and oversampling techniques can be constructed.

To generate minority samples, say, M times the original N_min, one needs to set the proportion parameter to M*N_min/(N_maj - N_min).

4 replies

BradKML Sep 16, 2022
Author

@gykovacs in that case the three major forms: naive oversamplers, proportion oversamplers, and noise filtered oversamplers, should get different treatment, as I have considered converting the proportion oversampler into noise filtered oversamplers by combining two different algorothms. e.g. SMOTE_TomekLinks as SMOTE (with proportion control) with TomekLinks.
The reason to do this is to not explode the size of the minority class to the majority class but somewhere in between to prevent extending classification time.

gykovacs Sep 16, 2022
Maintainer

Well, the landscape is a little bit more complicated than that. The concept of noise filtering (removal of some training vectors) is an integral part of many oversampling techniques. Some of these noise filtering concepts are present in the literature, e.g. TomekLinkRemoval, these appear as separate classes in the smote_variants package. But still, other noise filtering algorithms are integral parts of many oversamplers, e.g. SYMPROD has its own concept and algorithm for noise removal. However, as these noise removal algorithms has not been published in the literature separately and are a little bit ad-hoc in many cases, they do not appear as separate classes.

The oversampling techniques are categorized in many ways, including whether or not they contain any form of noise removal. These categories are available in the categories member of each oversampling class. For exmaple,

> import smote_variants as sv

> sv.SMOTE.categories

['SO', 'Ex', 'CD']

The resolution of these categories is available in the OverSamplingBase base class, for the sake of completeness:

    cat_noise_removal = 'NR'
    cat_dim_reduction = 'DR'
    cat_uses_classifier = 'Clas'
    cat_sample_componentwise = 'SCmp'
    cat_sample_ordinary = 'SO'
    cat_sample_copy = 'SCpy'
    cat_memetic = 'M'
    cat_density_estimation = 'DE'
    cat_density_based = 'DB'
    cat_extensive = 'Ex'
    cat_changes_majority = 'CM'
    cat_uses_clustering = 'Clus'
    cat_borderline = 'BL'
    cat_application = 'A'
    cat_metric_learning = 'CD'

So, if you are interested in all oversampling techniques applying any form of noise filtering, you can query them like

import smote_variants as sv

oversamplers = sv.get_all_oversamplers()

noise_removal_os = [o for o in oversamplers if sv.base.OverSamplingBase.cat_noise_removal in o.categories]

Please note that most of the oversampling techniques are actually some sort of oversampling pipelines, they might contain clustering, noise removal prior to oversampling, noise removal after oversampling, dimensionality reduction, etc., and all of these can be embedded into some memetic optimization. Therefore, it is very hard to categorize oversampling techniques as "naive oversamplers, proportion oversamplers and noise filtered oversamplers".

BradKML Sep 16, 2022
Author

I guess cat_changes_majority = 'CM' and cat_noise_removal = 'NR' for warping the majority class (and lacking control in undersampling), and cat_sample_copy = 'SCpy' for quality control, but otherwise the others looks okay. Thanks for the clarification

gykovacs Sep 16, 2022
Maintainer

Actually, the majority class is changed in many ways: there are methods which literally shift the majority points, others relabel minority vectors as majority, etc. Those methods which dont have the 'CM' flag keep the majority set intact, but might change the minority set, e.g. noise removal. If both 'CM' and 'NR' are excluded it means that the original dataset is kept intact and newly generated samples are only appended to the original dataset. These are the ones which I refer to as 'extensive' methods.

The 'SCpy' category encodes another property: most of the techniques use the original SMOTE scheme, that is, at some certain point they will generate samples on edges between neighboring minority samples. However, there are some more conservative techniques, which only add copies of some certain minority samples to the minority set. Effectively, these ones are equivalent to assign higher weights to some minority samples. Long story short, those techniques with 'SCpy' wont generate new samples, they simply add replicas of existing samples to the training set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Combining these with Undersampling #62

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question: Combining these with Undersampling #62

BradKML Sep 15, 2022

Replies: 3 comments · 4 replies

gykovacs Sep 15, 2022 Maintainer

BradKML Sep 16, 2022 Author

gykovacs Sep 16, 2022 Maintainer

BradKML Sep 16, 2022 Author

gykovacs Sep 16, 2022 Maintainer

BradKML Sep 16, 2022 Author

gykovacs Sep 16, 2022 Maintainer

BradKML
Sep 15, 2022

Replies: 3 comments 4 replies

gykovacs
Sep 15, 2022
Maintainer

BradKML
Sep 16, 2022
Author

gykovacs
Sep 16, 2022
Maintainer

BradKML Sep 16, 2022
Author

gykovacs Sep 16, 2022
Maintainer

BradKML Sep 16, 2022
Author

gykovacs Sep 16, 2022
Maintainer