-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ss test mlpci #39
Ss test mlpci #39
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Soroush! I liked how you were very thorough with changing fs_pred_algo.py after making changes to fs_algo_train_eval.py that would impact the downstream processing. Please review the suggestions, and once you've addressed those I'll then give it another review. If I don't have immediate feedback from reading the code, I'll then try to run the code.
ci = self.calculate_rf_uncertainty(rf, self.X_train, self.X_test) | ||
|
||
# Calculating mlp uncertainty using Bootstrap Aggregating (Bagging) | ||
n_models_rf = 10 # Number of bootstrap models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's best to make this a parameter that a user can assign rather than hard-code it into a function. We should add a parameter in the algo config yaml, with a default value set as 10 when reading in that yaml file using the AttrConfigAndVars()
class. For example,
self.algo_config['rf'].get('n_models_rf_bootstrap',10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much Guy for the comments and advice! All the updates are applied to the ssorou1: ss_test_mapie branch:
xssa_algo_config.yaml is updated to include n_models_rf_bootstrap
and n_models_mlp_bootstrap
for rf and mlp models, respectively.
fs_algo_train_eval.py is updated to read the aforementioned variables from the algo config yaml. Please refer to rf_Bagging_ci
and mlp_Bagging_ci
functions.
rf.fit(X_train_resampled, y_train_resampled) | ||
|
||
# Store predictions for the test set | ||
rf_predictions.append(rf.predict(self.X_test)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were to run train_algos()
in a loop, does the rf_predictions
object continue to grow, or does it reset within each loop? We'd want the latter behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rf_predictions
object in the newly added rf_Bagging_ci()
is a local variable within the function, meaning it gets reinitialized as an empty list (rf_predictions = []
) each time the function is called. Since train_algos()
calls rf_Bagging_ci()
, rf_predictions
is reset in each loop iteration if train_algos()
is run in a loop.
rf_predictions = np.array(rf_predictions) | ||
mean_pred = rf_predictions.mean(axis=0) | ||
std_pred = rf_predictions.std(axis=0) | ||
lower_bound = mean_pred - 1.96 * std_pred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clarify that this is a 95% confidence interval (a good default!), and perhaps add in options for different confidence intervals (e.g. 90% and 99%).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Soroush! The specific intervals of interest could vary by user needs. Let's set a default of 95%, but can you also modify the config file so that a user may specify their desired confidence interval or intervals? If not already a feature, we'll have to be careful about how we save file output so that one can know which interval they're looking at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Guy! The yaml file is updated to include the confidence_level
for each model. rf_Bagging_ci()
and mlp_Bagging_ci()
are updated to read the respective confidence levels from the yaml file and calculate the confidence interval based on the normal distribution assumption. The default is 95% as per your advice.
https://github.com/ssorou1/formulation-selector/blob/3bc23eacc6d54b692147f8cfded578fd8657c17e/pkg/fs_algo/fs_algo/fs_algo_train_eval.py#L1047-L1052
https://github.com/ssorou1/formulation-selector/blob/3bc23eacc6d54b692147f8cfded578fd8657c17e/scripts/eval_ingest/xssa/xssa_algo_config.yaml#L3-L6
std_pred = predictions.std(axis=0) | ||
lower_bound = mean_pred - 1.96 * std_pred | ||
upper_bound = mean_pred + 1.96 * std_pred | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that these steps are exactly the same as the random forest makes me think we should instead put this all into a new function, and you pass in the model object as a generic Regressor object (e.g. MLPRegressor, RandomForestRegressor) as an argument in the new function. That means we could re-use all the same processing steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rf_Bagging_ci()
and mlp_Bagging_ci()
are developed in "fs_algo_train_eval.py".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add pkg/proc.attr.hydfab/.RData to your own .gitignore
file and this won't be a recurring problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied. Thanks!
# pipe = joblib.load(path_algo) | ||
pipeline_with_ci = joblib.load(path_algo) | ||
pipe = pipeline_with_ci['pipe'] # Assign the actual pipeline (pipe) to 'pipe' | ||
rf_model = pipe.named_steps['randomforestregressor'] # Use the correct step name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when we haven't trained a randomforestregressor? This seems like it's too specific to a particular type of model. Ideally this should work with any model. It looks like since this relates to the forestci
package, we should add in an 'if' statement specific to random forest while looping over all possible algorithm possibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a valid point. Thanks! Both fs_pred_algo.py and fs_algo_train_eval.py were updated to address this issue. Please refer to commit 5c9e14b.
Shifting over to PR #41 |
The Bagging method to calculate ci is implemented into fs_algo_train_eval for rf and mlp models.
Additions
Removals
Changes
Testing
Screenshots
Notes
Todos
Checklist
Testing checklist
Target Environment support
Accessibility
Other