Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: adds support for reading mosaic mds written dataset #210

Merged
merged 28 commits into from
Jul 8, 2024

Conversation

bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Jul 6, 2024

What does this PR do?

Fixes #195.

@bhimrazy bhimrazy marked this pull request as draft July 6, 2024 10:32
Copy link

codecov bot commented Jul 6, 2024

Codecov Report

Attention: Patch coverage is 96.36364% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@70dc4a4). Learn more about missing BASE report.

Additional details and impacted files
@@          Coverage Diff          @@
##             main   #210   +/-   ##
=====================================
  Coverage        ?    77%           
=====================================
  Files           ?     33           
  Lines           ?   4704           
  Branches        ?      0           
=====================================
  Hits            ?   3625           
  Misses          ?   1079           
  Partials        ?      0           

@bhimrazy bhimrazy marked this pull request as ready for review July 6, 2024 12:00
Copy link
Collaborator

@deependujha deependujha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Few suggestions from my side.

And update the wrong spelling of Parallelize in readme, which is causing pre-commit-ci to fail.

README.md:55: Paralellize ==> Parallelize
README.md:653: Paralellize ==> Parallelize

src/litdata/utilities/dataset_utilities.py Outdated Show resolved Hide resolved
tests/streaming/test_dataset.py Outdated Show resolved Hide resolved
@bhimrazy bhimrazy requested a review from deependujha July 7, 2024 18:07
Copy link
Collaborator

@deependujha deependujha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Added tests for checking if the mds dataset supports Litdata streamingDataset features like train_test_split, subsample, dataloader (last_drop=True | False).

One test fails on mac (timeout), even though I've not changed anything in any function.

@bhimrazy
Copy link
Collaborator Author

bhimrazy commented Jul 8, 2024

Thank you @deependujha for adding these additional tests.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work ! It would be good to benchmark this once more.

@tchaton tchaton merged commit 14a47f2 into Lightning-AI:main Jul 8, 2024
28 checks passed
@bhimrazy bhimrazy deleted the feat/adds-mosaic-mds-support branch July 8, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Mosaic Streaming WDS data format
3 participants