Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune schedule #974

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Tune schedule #974

wants to merge 36 commits into from

Conversation

topepo
Copy link
Member

@topepo topepo commented Dec 18, 2024

This draft PR is for testing a new data structure that governs how tune will efficiently process grids. Two new functions are currently exported for debugging but will not be exported in the final version:

  • get_tune_schedule() produces a tibble structure with nested levels that contains information about what tuning parameters should be used at different stages of execution. This PR is focused on making sure that this structure accounts for all of the tuning scenarios.
  • loopy() is a temporary function that emulates tune_grid_loop_iter() (without the bells and whistles) that is used to test get_tune_schedule() and also to understand how we will incorporate postprocessors into the mix.

@topepo topepo marked this pull request as ready for review January 9, 2025 16:50
@topepo topepo requested review from hfrick and simonpcouch January 9, 2025 16:51
@hfrick
Copy link
Member

hfrick commented Jan 13, 2025

My high-level thoughts on get_schedule()

  • Introducing pre_stage
    • We do not need another nesting structure since the preprocessing stage is the "top-level" stage in this context but I'm wondering if it would be a good idea nonetheless: Currently, all the other stages have a named (nested) tibble associated with them, even if the tibble is empty (i.e., has zero rows). That makes it easier to understand the object structure imo which is my main motivation for bringing this up. Bonus: it makes all the stages feel the same / have a corresponding structure.
  • Breaking up get_schedule()
    • Into schedule_proproc(), schedule_model_stage(), schedule_predict_stage() which would all schedule "their" stage and push the rest of the candidate values into a tibble for the next stage. We'd call them recursively.
    • This should make it easier to test the scheduling components separately and then their nesting within each other.
    • It would also make it easier to add a schedule_postproc() if we ever want to get more complex about how to schedule in this stage. (I'm thinking about whether or not we allow more flexibility in the order of post-processing adjustments and/or add more adjustments.) Or tinker with how to schedule the preprocessing in a future with more info on the order of recipe steps?
  • Keeping the scheduling direction linear
    • This is the thing that struck me already when reading the corresponding writeup.
    • I'd prefer to start "outside" with preprocessing and end with post-stage, not go from model-stage to post-stage and then do the pre-stage.
    • Part of that would be to use min_grid() only on the model parameters, without preprocessing or post-processing parameters, essentially using it as, or creating a, minimal_model_grid().

@topepo
Copy link
Member Author

topepo commented Jan 14, 2025

Note: let's take a look at how this structure can work with agua, which does the model fits en masse via h2o.

@hfrick
Copy link
Member

hfrick commented Jan 14, 2025

Another note: We currently use the parameter set to determine which parameter, i.e., column in the grid, is associated with the preprocessing, model, or post-processing stage (via the $source column). With #660 in mind, we might want to try to get to that information without adding another dependency on the parameter set. Using tune_args(wflow) might be a solution, although, ideally, we want something that tells us the location of the tune() placeholders in the workflow, and judging from #660 that's not always what tune_args() does either. (But maybe should? - That's a different issue though.)

@topepo
Copy link
Member Author

topepo commented Jan 15, 2025

For agua... the current routine for processing the grid normally is here and the agua equivalent is here. Basically, agua consolidates the work that tune_grid() does via loop into a single function call.

For loopy(), I think that we would have to take a similar strategy. With loopy(), there isn't a top-level postprocessing loop within the model loop. That's because of the three cases mentioned here. For agua, we are likely to need a post-processing loop since we have all of the model results at once, and the post-actions will happen on a model-by-model basis.

TL;DR

agua will work similarly to tune, except instead of losing a loop, the loss of the model loop is offset by a post loop. It should still be very efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants