Multi-threaded data loading and augmentation? #50

GilesStrong · 2020-06-05T11:46:06Z

Current state

The current process for loading data during training is:

A complete fold of data is loaded from hard-drive (hdf5) by a FoldYielder
Any requested data augmentation is applied to the fold
The fold is then passed to a BatchYielder. Either the entire fold is then loaded to device at once, or mini-batches are loaded to device one at a time
Mini-batches are passed through the model and parameters are updated

The current process for loading data during predicting is:

A complete fold of data is loaded from hard-drive (hdf5) by a FoldYielder
Any requested data augmentation is applied to the fold
The entire fold if passed through the model, or mini-batches are passed separately

Problems

The use of data augmentation currently causes perceptible slow-downs during training and testing
Loading data to device can be slow: quicker to load entire fold at once, but requires large memory

Possible solutions

Data augmentation is applied using multi-threading. Should be trivial, but splitting and concatenating of DataFrames may actually slow down process. Maybe Dask could be useful?
Worker processes are used by BatchYielder to load minibatches to device in the background, reducing the memory overhead whilst not leading to delays.
- Could perhaps replace BatchYielder with, or inherit from, a PyTorch Dataloader, which includes multi-threaded workers (although I find that they're slower than single-core...)

The text was updated successfully, but these errors were encountered:

GilesStrong added improvement Something which would improve current status, but not add anything new low priority Not urgent and won't degrade with time labels Jun 5, 2020

GilesStrong mentioned this issue Jun 8, 2020

Change HEPAugFoldYielder to callback? #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded data loading and augmentation? #50

Multi-threaded data loading and augmentation? #50

GilesStrong commented Jun 5, 2020

Multi-threaded data loading and augmentation? #50

Multi-threaded data loading and augmentation? #50

Comments

GilesStrong commented Jun 5, 2020

Current state

Problems

Possible solutions