Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit data split approach for training and validation #82

Open
Tracked by #91
emmamendelsohn opened this issue Mar 15, 2024 · 8 comments
Open
Tracked by #91

Revisit data split approach for training and validation #82

emmamendelsohn opened this issue Mar 15, 2024 · 8 comments

Comments

@emmamendelsohn
Copy link
Collaborator

  • Initial split randomly by date-district combinations. While we are not doing any spatial extraction, we need to have spatial splits because the outbreak data is so clustered in time that we wouldn't have enough coverage in our splits.
  • Mask from the training set the three months following the holdout dates for the given district and the surrounding districts. The reasoning for this is a) our data has three month lags for weather and NDVI, and so we want to avoid data leakage from our holdout set into the lags of our training set, and b) spatial masking prevents the model from relying too heavily on surrounding districts to make predictions for the holdout district. Given the logic of preventing data leakage from the holdout set, we should also mask the surrounding districts for the following three months in case the "future" surrounding data has an impact on the "current" predictions.
  • Cross validation on the training set should basically mirror the approach above. We need to enforce at least 3 months between each test date, and mask out surrounding districts.

We talked about how the immunity and recent outbreak layers present a challenge related to data leakage from the holdout set into the training set. As they are longer-term cumulative, it wouldn't be possible to mask them out as we do with the three month lags. It could present the problem of having "future" information hidden in the training set. And if anything, this could be more of a "give away" than future NDVI or weather data, because it's explicitly about the outcome variable. While we may not be able to solve the problem entirely, we could try masking out more than three months. Maybe one year, to at least deal with the leakage from the recent outbreak layer? Would have to take a look at how much data that leaves us to work with.

cc @noamross

image

emmamendelsohn added a commit that referenced this issue Mar 29, 2024
@emmamendelsohn
Copy link
Collaborator Author

emmamendelsohn commented Mar 30, 2024

Step 2 above reduces the training data from 17,721 to 399 rows. So, we can't mask dates AND surrounding districts for 3 months.

Trying a) mask the district itself for 3 months and b) mask the surrounding districts only for the date itself. This leaves 6575 rows and 58 of the 192 outbreaks in the training dataset. There will be further data reduction when we apply the masking approach within each split.

So far I have been masking the holdout dataset against the full training dataset. I think we could probably mask only the analysis splits of the CV, ie leave masked data in the assessment splits. This wouldn't improve data availability for training, but could give us more assessment data points.

@emmamendelsohn
Copy link
Collaborator Author

emmamendelsohn commented Apr 23, 2024

I think the bigger concern will be the immunity and recent outbreak layers, as these can cause data leakage of the outcome variable, whereas lagged weather/NDVI data is much less informative.

I tried masking only the surrounding districts on the given holdout day, and that still reduces the dataset by 70%.
image

The surrounding areas on the given holdout day likely is a true leakage issue.

@emmamendelsohn
Copy link
Collaborator Author

Our current approach is to split training to include pre-2018 outbreak and validation to be 2018 outbreak onward. This presents a challenge because 2018 was a single outbreak in a single district, so we may need to rearrange the splits to include more outbreaks in the validation.

@emmamendelsohn
Copy link
Collaborator Author

I believe this leaves just two points in the validation dataset, so there may be a need to revisit this approach. I would say to focus on building out the model before revisiting this.

@emmamendelsohn emmamendelsohn changed the title Training split approach Revisit data split approach for training and validation Jun 28, 2024
@n8layman
Copy link
Collaborator

n8layman commented Jan 17, 2025

@emmamendelsohn, this was very smart! Combining both spatial and temporal cross-validation using a skip of n-1. It depends on having only one row per date/shape combo. All but one adm gets the expanding window treatment with a different one left out each time. Unfortunately with lagged (and forecast) data I've got multiple rows per data/shape so I will have to try to come up with my own clever solution....which is I guess just to use your approach but first make the data wide.

tar_target(rolling_n, n_distinct(model_data$shapeName)),
tar_target(splits, rolling_origin(training_data,
                                  initial = rolling_n,
                                  assess = rolling_n,
                                  skip = rolling_n - 1))

@n8layman
Copy link
Collaborator

Nope. I misunderstood. I thought this was a way to remove one population at a time iteratively. Instead skip is the number of additional data rows to move forward. It always moves one row forward thus the n-1. So the above isn't spatial at all.

@n8layman
Copy link
Collaborator

n8layman commented Jan 26, 2025

Goals:

  1. Come up with hyperparameters that can generalize across both space and time.
  2. Minimize both temporal and spatial data leakage.
  3. Evaluate the performance of different model specifications.

The full plan:

  1. Use a semi-nested cross-validation approach.
  2. Set the outer loop as an expanding window using sample
  3. Set the inner loop as leave-one-location-out resampling from spatialsample.
  4. For each outer fold, track the performance of every inner fold across the hyperparameter grid.
  5. Instead of traditional nested cross-validation which uses the outer folds for model selection, we're logging all hyperparameter combinations and their performance across all folds, both outer and inner. Then we aggregate performance metrics across all folds to identify the best set of hyperparameters generalized across both time and space.
  6. Next, we re-train the model using all available training data and using the chosen best set of hyper-parameters.
  7. Finally, we need to evaluate the performance of model specification + best hyper-parameter set against another hold-out dataset so we can compare different model specifications.
  8. In Emma's description above, we use pre-2018 for training and post-2018 data to measure model performance. This is known as 'flat cross-validation' at least across model specifications, but that should be fine re Wainer and Cawley, 2021

Image

The simplified plan:

  1. Don't use spatial cross-validation (the inner loops)
  2. Use expanding window temporal cross-validation only, lumping all data from all districts together for each fold.
  3. Next, we re-train the model using all available training data and the best set of hyper-parameters identified.
  4. Finally, we need to evaluate the performance of model specification + best hyper-parameter set against another hold-out dataset so we can compare different model specifications.
  5. In Emma's description above, we use pre-2018 for training and post-2018 data to measure model performance. This is known as 'flat cross-validation' at least across model specifications, but that should be fine re Wainer and Cawley, 2021

Question:

  1. Do both temporal and spatial columns (date and district) need to be excluded from the model given the lag and CV setup?

Notes:

  1. For the future, it looks like lags can be set up within the tidymodels preprocessing step using recipes::step_lag() when building the recipe. That might not work here as we have multiple rows per date (1 per district). The advantage of this approach is that it ensures that the lagged data for the training set is constructed only using training data, preventing data leakage. We built it by hand but it should be fine since we are ensuring both spatial and temporal isolation in other ways. Still, this is cool. Here's a rough example recipes::step_lag(rvf_cases, lag = 1, by = district) %>% # Example group by district

@n8layman
Copy link
Collaborator

Here's a sample script with the workflow for the full plan. https://github.com/ecohealthalliance/open-rvfcast/blob/feature/tidymodels/scripts/spatiotemporal_cv.R

It takes quite a while to fit all the models during model tuning. I've been implementing this on the actual data in targets to take advantage of dynamic branching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants