2.17.0
Dataset Features
- [WebDataset] Audio support and bug fixes by @lhoestq in #6573
- Add concurrent loading of shards to datasets.load_from_disk by @kkoutini in #6464
- Support data_dir parameter in push_to_hub by @albertvillanova in #6634
- Support push_to_hub without org/user to default to logged-in user by @albertvillanova in #6629
- Allow concatenation of datasets with mixed structs by @Dref360 in #6587
General improvements and bug fixes
- Fix parallel downloads for datasets without scripts by @lhoestq in #6551
- Fix imagefolder with one image by @lhoestq in #6556
- Fix tests based on datasets that used to have scripts by @lhoestq in #6574
- remove eli5 test by @lhoestq in #6583
- [IterableDataset] Fix
drop_last_batch
in map after shuffling or sharding by @lhoestq in #6575 - Support standalone yaml by @lhoestq in #6557
- Drop redundant None guard. by @xkszltl in #6596
- fix os.listdir return name is empty string by @d710055071 in #6581
- Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in #6617
- Dedicated RNG object for fingerprinting by @mariosasko in #6606
- Migrate from
setup.cfg
topyproject.toml
by @mariosasko in #6619 - keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in #6586
- Read GeoParquet files using parquet reader by @weiji14 in #6508
- Use schema metadata only if it matches features by @lhoestq in #6616
- Raise error on bad split name by @lhoestq in #6626
- Disable
tqdm
bars in non-interactive environments by @mariosasko in #6627 - Add
with_rank
param toDataset.filter
by @mariosasko in #6608 - Bump max range of dill to 0.3.8 by @ringohoffman in #6630
- Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in #6631
- Faster webdataset streaming by @lhoestq in #6578
- Multi gpu docs by @lhoestq in #6550
- dataset viewer requires no-script by @severo in #6633
- Make split slicing consistent with list slicing by @mariosasko in #5891
- Do not use Parquet exports if revision is passed by @albertvillanova in #6555
- Make CLI test support multi-processing by @albertvillanova in #6628
- Fix reload cache with data dir by @lhoestq in #6632
- Fix array cast/embed with null values by @mariosasko in #6283
- Faster column validation and reordering by @psmyth94 in #6636
- Better multi-gpu example by @lhoestq in #6646
- Fix missing info when loading some datasets from Parquet export by @lhoestq in #6635
- Minor multi gpu doc improvement by @lhoestq in #6649
- Document usage of hfh cli instead of git by @lhoestq in #6648
New Contributors
- @xkszltl made their first contribution in #6596
- @kkoutini made their first contribution in #6464
- @JochenSiegWork made their first contribution in #6586
- @weiji14 made their first contribution in #6508
- @ringohoffman made their first contribution in #6630
- @psmyth94 made their first contribution in #6636
Full Changelog: 2.16.1...2.17.0