Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded data loading and augmentation? #50

Open
GilesStrong opened this issue Jun 5, 2020 · 0 comments
Open

Multi-threaded data loading and augmentation? #50

GilesStrong opened this issue Jun 5, 2020 · 0 comments
Labels
improvement Something which would improve current status, but not add anything new low priority Not urgent and won't degrade with time

Comments

@GilesStrong
Copy link
Owner

Current state

The current process for loading data during training is:

  1. A complete fold of data is loaded from hard-drive (hdf5) by a FoldYielder
  2. Any requested data augmentation is applied to the fold
  3. The fold is then passed to a BatchYielder. Either the entire fold is then loaded to device at once, or mini-batches are loaded to device one at a time
  4. Mini-batches are passed through the model and parameters are updated

The current process for loading data during predicting is:

  1. A complete fold of data is loaded from hard-drive (hdf5) by a FoldYielder
  2. Any requested data augmentation is applied to the fold
  3. The entire fold if passed through the model, or mini-batches are passed separately

Problems

  • The use of data augmentation currently causes perceptible slow-downs during training and testing
  • Loading data to device can be slow: quicker to load entire fold at once, but requires large memory

Possible solutions

  • Data augmentation is applied using multi-threading. Should be trivial, but splitting and concatenating of DataFrames may actually slow down process. Maybe Dask could be useful?
  • Worker processes are used by BatchYielder to load minibatches to device in the background, reducing the memory overhead whilst not leading to delays.
    • Could perhaps replace BatchYielder with, or inherit from, a PyTorch Dataloader, which includes multi-threaded workers (although I find that they're slower than single-core...)
@GilesStrong GilesStrong added improvement Something which would improve current status, but not add anything new low priority Not urgent and won't degrade with time labels Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something which would improve current status, but not add anything new low priority Not urgent and won't degrade with time
Projects
None yet
Development

No branches or pull requests

1 participant