-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up data loading and randomization; add scheduler #29
Clean up data loading and randomization; add scheduler #29
Conversation
pokey
commented
Dec 23, 2022
•
edited
Loading
edited
- Fixes Implement a learning rate schedule #18
- Fixes Consistent train / dev split #15
dataset_rng = torch.Generator().manual_seed(self.data_seed) | ||
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [self.dataset_size - split, split], generator=dataset_rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the dataset randomizer now has its own seed, and we do the split once, rather than doing it once per ensemble member
self.optimizers.append(optimizer) | ||
self.schedulers.append( torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')) | ||
|
||
self.train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create a shuffling data loader, so we don't need to do shuffling during training; that happens for us automatically
self.validation_loaders.append(torch.utils.data.DataLoader(dataset, batch_size=self.batch_size, sampler=valid_sampler)) | ||
optimizer = optim.SGD(self.nets[i].parameters(), lr=0.003, momentum=0.9, nesterov=True) | ||
self.optimizers.append(optimizer) | ||
self.schedulers.append( torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a scheduler
I initially placed a different validation set / train set on each net to make sure each net would see different data ( thus performing better as an ensemble than if they all saw the same data ). Otherwise the only variation they would have after many epochs would be the starting position right? My question would be: since 3 nets using different train / validation sets can use more data for their training, wouldn't they perform better on novel data? Given a testset which none of the models have seen, wouldn't 3 ensembled models with different training data perform better on than an ensemble created where each model saw exactly the same data? ( From a combined ensemble validation score I can definitely see the benefits of a single split train / validation set, because with an ensemble where the validation set has been seen by some of the models there's obvious data pollution and the results will be skewed ) |
Huh interesting idea. Makes me a bit nervous, eg we'd want to make sure we know which data seed each ensemble member was using in case we want to resume from checkpoint. But I guess could work? @ym-han any thoughts? Is this something you've seen before? Reminds me of k-fold cross validation tbh, tho not exactly the same |
I haven't looked at the code so I can't be sure, but it sounds like this could be bagging (or something similar). There's some discussion and a link to some references on the sklearn bagging classifier page (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html), and which I'm going to quote:
Breiman 1996 "Bagging Predictors", the chapter on bagging in Richard Berk's Statistical Learning from a Regression Perspective, and the section on bagging in Elements of Statistical Learning seem useful if you want to read up more. |
Yeah that sounds right. But I presume all of those methods assume you're sampling the training data with a fixed held out validation set, whereas here the validation set of one ensemble member is used as training data for another |
Did some reading on KFold and cross validation and Pokey is right in the sense that there is a held out set kept separate from the validation sets for each model. I.e. given we have a data set of A, B, C ... Z, and two models, these are the current splits ( note the overlap ):
Whereas the, in my opinion, best split would be:
Where we keep 10 percent of the total data set held out to test the ensemble on, use the remaining 90 percent for training and validation, and have the models have validation sets that do not overlap with one another. The issue then is still that we need to keep the random seed persisted to keep the split available for checkpointing, but I think this would generate the best model results that do not have any data pollution going on in the validation sets. |
This table is helpful. Yes, I agree that it's important that whatever data is in the test set for the ensemble is not data that has been used to either train or tune the hyperparams of any of the models of the ensemble. |
hmm I'd be tempted to either
Happy to hash this one out on Discord tho |
The second option seems fine for now, we can revisit the held out data set at a later date |
Ok I'll close this one for now; at some point prob worth cleaning up the split code and pulling the LR schedule stuff out of here, but this PR would need a lot of tweaking to get there. I pointed to this PR from the relevant issues |