Multiprocessing not improving i/o bound performance - how to train effectively with larger datasets? #20458

openSourcerer9000 · 2024-11-06T16:14:58Z

openSourcerer9000
Nov 6, 2024

So I'm trying to train a model on some n- dimensional data. The data is stored locally as a .zarr, And read into python as an xarray
dataset object ( lazily loaded and reading from disk during getitem). the whole dataset is larger than memory. This is my code:

import os
os.environ["KERAS_BACKEND"] = "torch"
from keras.utils import PyDataset
class Greataset(PyDataset):
    '''
    '''
    def __init__(self,
            data: xr.Dataset,
            sdim: str='s',
            xname: str='x',
            yname: str='y',
            batchsize: int=128,
            fillval: float = 0,
            xscaler=None,
            **kwargs):
        super().__init__(**kwargs)
        self.ds = data.copy()
        assert self.ds[xname].dims[0] == sdim, f'{sdim} Should come first\n{self.ds[xname].dims}'
        assert self.ds[yname].dims[0] == sdim, f'{sdim} Should come first\n{self.ds[yname].dims}'
        self.batch_size = batchsize
        self.tmplen = len(self.ds[sdim])
        self.xname = xname
        self.yname = yname
        self.sdim = sdim
        self.fillval = fillval
        self.xscaler = xscaler
        
    def __len__(self):
        return self.tmplen // self.batch_size

    def __getitem__(self,
            idx: int):
        if idx >= self.__len__(): raise StopIteration

        low = idx * self.batch_size
        high = min(low + self.batch_size, self.tmplen)

        batch = self.ds.isel({self.sdim:slice(low,high)}).load().fillna(self.fillval)
        x,y = batch[self.xname].values,batch[self.yname].values
        if self.xscaler is not None:
            xshp = x.shape
            x = self.xscaler.transform( 
                x.reshape(xshp[0],-1)
                )
            x = x.reshape(xshp)

        assert not np.isnan(x).any()

        return [x,y]

trainds = trainds.chunk({sdim:batchsize})
trainxy = Greataset(trainds,sdim,xname,yname,xscaler=scaler,batchsize=batchsize,
    workers=15, use_multiprocessing=True,
                    fillval=fillval
                    )

model.fit(trainxy, 
    epochs=1, 
    )

The docs recommend PyDataset for multiprocessing. I tested and xarray datasets Can be pickled. If I have the multiprocessing flag on, my script Just freezes up at model.fit for several minutes, before starting. Whether or not multiprocessing is on, it runs excruciatingly slow (the same time/epoch either way), With my GPU just sitting at 4%. I'm not entirely convinced it's actually doing anything. however, with the flag off it seems the GPU sits at 0%, with occasional blips of 4%. having it on Has it at a continuous 4%, though it hangs for several minutes before starting, and Again upon completion, making it much slower than not using multiprocessing.

Has anyone successfully used keras with n-dim data in xarray, or With larger than memory datasets at all for that matter? Xarray supports state of the art parallelization with dask distributed, there seems to be a giant wrench in the machine somewhere with whatever keras is doing for parallelization.

fchollet · 2024-11-18T18:16:00Z

fchollet
Nov 18, 2024
Maintainer

You could create your own custom PyDataset that uses thread-based parallelism instead?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing not improving i/o bound performance - how to train effectively with larger datasets? #20458

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Multiprocessing not improving i/o bound performance - how to train effectively with larger datasets? #20458

openSourcerer9000 Nov 6, 2024

Replies: 1 comment

fchollet Nov 18, 2024 Maintainer

openSourcerer9000
Nov 6, 2024

fchollet
Nov 18, 2024
Maintainer