Multiprocessing not improving i/o bound performance - how to train effectively with larger datasets? #20458
Unanswered
openSourcerer9000
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
So I'm trying to train a model on some n- dimensional data. The data is stored locally as a .zarr, And read into python as an xarray
dataset object ( lazily loaded and reading from disk during getitem). the whole dataset is larger than memory. This is my code:
The docs recommend PyDataset for multiprocessing. I tested and xarray datasets Can be pickled. If I have the multiprocessing flag on, my script Just freezes up at model.fit for several minutes, before starting. Whether or not multiprocessing is on, it runs excruciatingly slow (the same time/epoch either way), With my GPU just sitting at 4%. I'm not entirely convinced it's actually doing anything. however, with the flag off it seems the GPU sits at 0%, with occasional blips of 4%. having it on Has it at a continuous 4%, though it hangs for several minutes before starting, and Again upon completion, making it much slower than not using multiprocessing.
Has anyone successfully used keras with n-dim data in xarray, or With larger than memory datasets at all for that matter? Xarray supports state of the art parallelization with dask distributed, there seems to be a giant wrench in the machine somewhere with whatever keras is doing for parallelization.
Beta Was this translation helpful? Give feedback.
All reactions