Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ss test mlpci #39

Closed
wants to merge 15 commits into from
Closed

Ss test mlpci #39

wants to merge 15 commits into from

Conversation

ssorou1
Copy link
Collaborator

@ssorou1 ssorou1 commented Jan 29, 2025

The Bagging method to calculate ci is implemented into fs_algo_train_eval for rf and mlp models.

Additions

Removals

Changes

Testing

Screenshots

Notes

Todos

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Visually tested in supported browsers and devices (see checklist below 👇)
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Target Environment support

  • Windows
  • Linux
  • Browser

Accessibility

  • Keyboard friendly
  • Screen reader friendly

Other

  • Is useable without CSS
  • Is useable without JS
  • Flexible from small to large screens
  • No linting errors or warnings
  • JavaScript tests are passing

@ssorou1 ssorou1 requested a review from glitt13 January 29, 2025 22:06
@glitt13 glitt13 changed the base branch from main to dev February 3, 2025 23:32
Copy link
Collaborator

@glitt13 glitt13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Soroush! I liked how you were very thorough with changing fs_pred_algo.py after making changes to fs_algo_train_eval.py that would impact the downstream processing. Please review the suggestions, and once you've addressed those I'll then give it another review. If I don't have immediate feedback from reading the code, I'll then try to run the code.

ci = self.calculate_rf_uncertainty(rf, self.X_train, self.X_test)

# Calculating mlp uncertainty using Bootstrap Aggregating (Bagging)
n_models_rf = 10 # Number of bootstrap models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best to make this a parameter that a user can assign rather than hard-code it into a function. We should add a parameter in the algo config yaml, with a default value set as 10 when reading in that yaml file using the AttrConfigAndVars() class. For example,
self.algo_config['rf'].get('n_models_rf_bootstrap',10)

Copy link
Collaborator Author

@ssorou1 ssorou1 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much Guy for the comments and advice! All the updates are applied to the ssorou1: ss_test_mapie branch:
xssa_algo_config.yaml is updated to include n_models_rf_bootstrap and n_models_mlp_bootstrap for rf and mlp models, respectively.
fs_algo_train_eval.py is updated to read the aforementioned variables from the algo config yaml. Please refer to rf_Bagging_ci and mlp_Bagging_ci functions.

rf.fit(X_train_resampled, y_train_resampled)

# Store predictions for the test set
rf_predictions.append(rf.predict(self.X_test))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were to run train_algos() in a loop, does the rf_predictions object continue to grow, or does it reset within each loop? We'd want the latter behavior.

Copy link
Collaborator Author

@ssorou1 ssorou1 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rf_predictions object in the newly added rf_Bagging_ci() is a local variable within the function, meaning it gets reinitialized as an empty list (rf_predictions = []) each time the function is called. Since train_algos() calls rf_Bagging_ci(), rf_predictions is reset in each loop iteration if train_algos() is run in a loop.
image

rf_predictions = np.array(rf_predictions)
mean_pred = rf_predictions.mean(axis=0)
std_pred = rf_predictions.std(axis=0)
lower_bound = mean_pred - 1.96 * std_pred
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clarify that this is a 95% confidence interval (a good default!), and perhaps add in options for different confidence intervals (e.g. 90% and 99%).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great point Guy! The 90%, 95% and 99% ci are applied in both rf_Bagging_ci() and mlp_Bagging_ci() functions:
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Soroush! The specific intervals of interest could vary by user needs. Let's set a default of 95%, but can you also modify the config file so that a user may specify their desired confidence interval or intervals? If not already a feature, we'll have to be careful about how we save file output so that one can know which interval they're looking at.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@ssorou1 ssorou1 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Guy! The yaml file is updated to include the confidence_level for each model. rf_Bagging_ci() and mlp_Bagging_ci() are updated to read the respective confidence levels from the yaml file and calculate the confidence interval based on the normal distribution assumption. The default is 95% as per your advice.

https://github.com/ssorou1/formulation-selector/blob/3bc23eacc6d54b692147f8cfded578fd8657c17e/pkg/fs_algo/fs_algo/fs_algo_train_eval.py#L1047-L1052
https://github.com/ssorou1/formulation-selector/blob/3bc23eacc6d54b692147f8cfded578fd8657c17e/scripts/eval_ingest/xssa/xssa_algo_config.yaml#L3-L6

std_pred = predictions.std(axis=0)
lower_bound = mean_pred - 1.96 * std_pred
upper_bound = mean_pred + 1.96 * std_pred

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that these steps are exactly the same as the random forest makes me think we should instead put this all into a new function, and you pass in the model object as a generic Regressor object (e.g. MLPRegressor, RandomForestRegressor) as an argument in the new function. That means we could re-use all the same processing steps.

Copy link
Collaborator Author

@ssorou1 ssorou1 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rf_Bagging_ci() and mlp_Bagging_ci() are developed in "fs_algo_train_eval.py".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add pkg/proc.attr.hydfab/.RData to your own .gitignore file and this won't be a recurring problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied. Thanks!

# pipe = joblib.load(path_algo)
pipeline_with_ci = joblib.load(path_algo)
pipe = pipeline_with_ci['pipe'] # Assign the actual pipeline (pipe) to 'pipe'
rf_model = pipe.named_steps['randomforestregressor'] # Use the correct step name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when we haven't trained a randomforestregressor? This seems like it's too specific to a particular type of model. Ideally this should work with any model. It looks like since this relates to the forestci package, we should add in an 'if' statement specific to random forest while looping over all possible algorithm possibilities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid point. Thanks! Both fs_pred_algo.py and fs_algo_train_eval.py were updated to address this issue. Please refer to commit 5c9e14b.

@ssorou1 ssorou1 requested a review from glitt13 February 7, 2025 19:45
@glitt13 glitt13 closed this Feb 7, 2025
@glitt13
Copy link
Collaborator

glitt13 commented Feb 7, 2025

Shifting over to PR #41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants