Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMOGN is creating a new class for target #38

Open
purp172 opened this issue Nov 28, 2022 · 2 comments
Open

SMOGN is creating a new class for target #38

purp172 opened this issue Nov 28, 2022 · 2 comments

Comments

@purp172
Copy link

purp172 commented Nov 28, 2022

Hey!
Any idea on why is the algorithm creating a new class (value) for my target? I'm analyzing the Room_Occupancy_Dataset from Kaggle, and in this dataset the target only has four values for occupancy (0, 1, 2, 3 people in the room), but it is expected for the model to be able to predict other cases that have more than 3 people in the room. SMOGN is not balancing the data correctly, because the majority class (0) remains equal, and the minority classes (1,2,3) are not over-sampled. Plus, it creates an extra value (4). I don't know if this is a bug, but i hope you can help me fix it. This is my 2d array:

rg_mtrx = [

    [0, 0, 0],  ## under-sample ("majority")
    [1, 1, 0],  ## over-sample ("minority")
    [2, 1, 0],  ## over-sample ("minority")
    [3, 1, 0],  ## over-sample ("minority")
]

## conduct smogn
balanced_smogn = smogn.smoter(
    
    ## main arguments
    data = df,            ## pandas dataframe
    y = 'Room_Occupancy_Count', ## string ('header name')
    k = 5,                    ## positive integer (k < n)
    pert = 0.02,              ## real number (0 < R < 1)
    samp_method = 'extreme',  ## string ('balance' or 'extreme')
    drop_na_col = False,       ## boolean (True or False)
    drop_na_row = False,       ## boolean (True or False)
    replace = True,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres = 0.50,         ## real number (0 < R < 1)
    rel_method = 'manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg = rg_mtrx ## 2d array (format: [x, y])
)
@nickkunz
Copy link
Owner

Hello @Diogo-da-Silva-Rebelo, SMOGN was developed for regression. It seems like your problem is a classification one? If that is the case then SMOGN would note be useful. You may want to see if SMOTE is more appropriate. Thank you.

@purp172
Copy link
Author

purp172 commented Nov 29, 2022

Hello @Diogo-da-Silva-Rebelo, SMOGN was developed for regression. It seems like your problem is a classification one? If that is the case then SMOGN would note be useful. You may want to see if SMOTE is more appropriate. Thank you.

Hello, @nickkunz ! Thank you for responding. I don't think that's the case: I want to predict the number of people in the room, and not a specific class (not if the room has or not people inside). In fact, there's many values for the target and not only a restricted number. However, the target values must be integers, because we can't have 1.2 persons in the room :) Thus, it is a regression problem, when I said that the dataset only has four values, it does not mean that I can't have another values for instance in my test dataset. The algorithm is leaving all rows with the target = 0, even being that the value in majority. And it's not balancing, since the other values remain intact. What are you thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants