-
Notifications
You must be signed in to change notification settings - Fork 897
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Make a dataset shuffler which can shuffle per batch without reading a…
…ll of the batches into memory first. Saves memory on some of the excessively large datasets, such as DE_HDT
- Loading branch information
1 parent
82f7872
commit e4c2273
Showing
2 changed files
with
19 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -280,4 +280,20 @@ def resolve_none(data): | |
data[sent_idx][tok_idx][feat_idx] = '_' | ||
return data | ||
|
||
class ShuffledDataset: | ||
def __init__(self, datasets, batch_size): | ||
self.batch_size = batch_size | ||
self.datasets = datasets | ||
self.loaders = [x.to_loader(batch_size=self.batch_size, shuffle=True) for x in self.datasets] | ||
|
||
def __iter__(self): | ||
iterators = [iter(x) for x in self.loaders] | ||
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
Sorry, something went wrong.
AngledLuffa
Author
Collaborator
|
||
lengths = [len(x) for x in self.loaders] | ||
indices = [[x] * y for x, y in enumerate(lengths)] | ||
indices = [idx for inner in indices for idx in inner] | ||
|
||
for idx in indices: | ||
yield(next(iterators[idx])) | ||
|
||
def __len__(self): | ||
return sum(len(x) for x in self.datasets) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This feels like the same exact loading per iter which the previous loop does, I believe. I'm worried that doing this won't solve the OOM issues in German we saw.