-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit data split approach for training and validation #82
Comments
Step 2 above reduces the training data from 17,721 to 399 rows. So, we can't mask dates AND surrounding districts for 3 months. Trying a) mask the district itself for 3 months and b) mask the surrounding districts only for the date itself. This leaves 6575 rows and 58 of the 192 outbreaks in the training dataset. There will be further data reduction when we apply the masking approach within each split. So far I have been masking the holdout dataset against the full training dataset. I think we could probably mask only the analysis splits of the CV, ie leave masked data in the assessment splits. This wouldn't improve data availability for training, but could give us more assessment data points. |
Our current approach is to split training to include pre-2018 outbreak and validation to be 2018 outbreak onward. This presents a challenge because 2018 was a single outbreak in a single district, so we may need to rearrange the splits to include more outbreaks in the validation. |
I believe this leaves just two points in the validation dataset, so there may be a need to revisit this approach. I would say to focus on building out the model before revisiting this. |
@emmamendelsohn, this was very smart! Combining both spatial and temporal cross-validation using a skip of n-1. It depends on having only one row per date/shape combo. All but one adm gets the expanding window treatment with a different one left out each time. Unfortunately with lagged (and forecast) data I've got multiple rows per data/shape so I will have to try to come up with my own clever solution....which is I guess just to use your approach but first make the data wide.
|
Nope. I misunderstood. I thought this was a way to remove one population at a time iteratively. Instead skip is the number of additional data rows to move forward. It always moves one row forward thus the n-1. So the above isn't spatial at all. |
Goals:
The full plan:
The simplified plan:
Question:
Notes:
|
Here's a sample script with the workflow for the full plan. https://github.com/ecohealthalliance/open-rvfcast/blob/feature/tidymodels/scripts/spatiotemporal_cv.R It takes quite a while to fit all the models during model tuning. I've been implementing this on the actual data in targets to take advantage of dynamic branching. |
We talked about how the immunity and recent outbreak layers present a challenge related to data leakage from the holdout set into the training set. As they are longer-term cumulative, it wouldn't be possible to mask them out as we do with the three month lags. It could present the problem of having "future" information hidden in the training set. And if anything, this could be more of a "give away" than future NDVI or weather data, because it's explicitly about the outcome variable. While we may not be able to solve the problem entirely, we could try masking out more than three months. Maybe one year, to at least deal with the leakage from the recent outbreak layer? Would have to take a look at how much data that leaves us to work with.
cc @noamross
The text was updated successfully, but these errors were encountered: