Releases: huggingface/datasets
2.14.3
Bug fixes
- Fix error when loading from GCP bucket by @albertvillanova in #6105
- Fix deprecation of use_auth_token in file_utils by @albertvillanova in #6107
Full Changelog: 2.14.2...2.14.3
2.14.2
Bug fixes
- Fix deprecation of use_auth_token in DownloadConfig by @albertvillanova in #6094
- Fix deprecation of errors in TextConfig by @albertvillanova in #6095
Full Changelog: 2.14.1...2.14.2
2.14.1
Bug fixes
- fix tqdm lock by @lhoestq in #6067
- fix tqdm lock deletion by @lhoestq in #6068
- Fix fsspec storage_options from load_dataset by @lhoestq in #6072
- No gzip encoding from github by @lhoestq in #6076
Other improvements
- Fix
Overview.ipynb
& detach Jupyter Notebooks fromdatasets
repository by @alvarobartt in #5902 - Fix Quickstart notebook link by @mariosasko in #6070
- Remove README link to deprecated Colab notebook by @mariosasko in #6080
- Misc doc improvements by @mariosasko in #6074
Full Changelog: 2.14.0...2.14.1
2.14.0
Important: caching
- Datasets downloaded and cached using
datasets>=2.14.0
may not be reloaded from cache using older version ofdatasets
(and therefore re-downloaded). - Datasets that were already cached are still supported.
- This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
- This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.
Dataset Configuration
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
- Configure your dataset using YAML at the top of your dataset card (docs here)
- Choose which file goes into which split
--- configs: - config_name: default data_files: - split: train path: data.csv - split: test path: holdout.csv ---
- Define multiple dataset configurations
--- configs: - config_name: main_data data_files: main_data.csv - config_name: additional_data data_files: additional_data.csv ---
Dataset Features
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
push_to_hub()
additional dataset configurations
ds.push_to_hub("username/dataset_name", config_name="additional_data") # reload later ds = load_dataset("username/dataset_name", "additional_data")
-
Support returning dataframe in map transform by @mariosasko in #5995
What's Changed
- Deprecate
errors
param in favor ofencoding_errors
in text builder by @mariosasko in #5974 - Fix select_columns columns order by @lhoestq in #5994
- Replace metadata utils with
huggingface_hub
's RepoCard API by @mariosasko in #5949 - Pin
joblib
to avoidjoblibspark
test failures by @mariosasko in #6000 - Align
column_names
type check with type hint insort
by @mariosasko in #6001 - Deprecate
use_auth_token
in favor oftoken
by @mariosasko in #5996 - Drop Python 3.7 support by @mariosasko in #6005
- Misc improvements by @mariosasko in #6004
- Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
- Fix cast for dictionaries with no keys by @mariosasko in #6009
- Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
- Deprecate task api by @mariosasko in #5865
- Add metadata ui screenshot in docs by @lhoestq in #6015
- Fix
ClassLabel
min max check forNone
values by @mariosasko in #6023 - [docs] Update return statement of index search by @stevhliu in #6021
- Improve logging by @mariosasko in #6019
- Fix style with ruff 0.0.278 by @lhoestq in #6026
- Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
- Delete
task_templates
inIterableDataset
when they are no longer valid by @mariosasko in #6027 - [docs] Fix link by @stevhliu in #6029
- fixed typo in comment by @NightMachinery in #6030
- Fix legacy_dataset_infos by @lhoestq in #6040
- Flatten repository_structure docs on yaml by @lhoestq in #6041
- Use new hffs by @lhoestq in #6028
- Bump dev version by @lhoestq in #6047
- Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
- Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
- Remove
HfFileSystem
and deprecateS3FileSystem
by @mariosasko in #6052 - Dill 3.7 support by @mariosasko in #6061
- Improve
Dataset.from_list
docstring by @mariosasko in #6062 - Check if column names match in Parquet loader only when config
features
are specified by @mariosasko in #6045 - Release: 2.14.0 by @lhoestq in #6063
New Contributors
- @mathewjacob1002 made their first contribution in #5986
- @pappacena made their first contribution in #5976
Full Changelog: 2.13.1...2.14.0
2.13.1
General improvements and bug fixes
- Fix JSON generation in benchmarks CI by @mariosasko in #5966
- Always return list in
list_datasets
by @mariosasko in #5964 - Add
encoding
anderrors
params to JSON loader by @mariosasko in #5969 - Filter unsupported extensions by @lhoestq in #5972
Full Changelog: 2.13.0...2.13.1
2.13.0
Dataset Features
-
Add IterableDataset.from_spark by @maddiedawson in #5770
- Stream the data from your Spark DataFrame directly to your training pipeline
from datasets import IterableDataset from torch.utils.data import DataLoader ids = IterableDataset.from_spark(df) ids = ids.map(...).filter(...).with_format("torch") for batch in DataLoader(ids, batch_size=16, num_workers=4): ...
-
IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
- IterableDataset Arrow formatting by @lhoestq in #5821
- Iterable torch formatting by @lhoestq in #5852
from datasets import load_dataset ids = load_dataset("c4", "en", split="train", streaming=True) ids = ids.map(...).with_format("torch") # to get PyTorch tensors - also works with tf, np, jax etc.
-
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in #5893
from datasets import IterableDataset ids = IterableDataset.from_file("path/to/data.arrow")
-
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in #5944
from datasets import load_dataset ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
Experimental
General improvements and bug fixes
- Preserve
stopping_strategy
of shuffled interleaved dataset (random cycling case) by @mariosasko in #5816 - Fix incomplete docstring for
BuilderConfig
by @Laurent2916 in #5824 - [docs] Custom decoding transforms by @stevhliu in #5836
- Add
accelerate
as metric's test dependency to fix CI error by @mariosasko in #5848 - Add
date_format
param to the CSV reader by @mariosasko in #5845 - [docs] Redirects, migrated from nginx by @julien-c in #5853
- Fix infer module for uppercase extensions by @albertvillanova in #5872
- Minor tqdm optim by @lhoestq in #5860
- Always set nullable fields in the writer by @lhoestq in #5835
- Add
fn_kwargs
tomap
andfilter
ofIterableDataset
andIterableDatasetDict
by @yuukicammy in #5810 - Better error message when combining dataset dicts instead of datasets by @lhoestq in #5861
- Force overwrite existing filesystem protocol by @baskrahmer in #5894
- Support working_dir in from_spark by @maddiedawson in #5826
- Raise TypeError when indexing a dataset with bool by @albertvillanova in #5859
- Fix minor typo in docs loading.mdx by @albertvillanova in #5900
- Fix
FixedSizeListArray
casting by @mariosasko in #5897 - Unpin responses by @mariosasko in #5916
- Validate name parameter in make_file_instructions by @albertvillanova in #5904
- Raise error in
DatasetBuilder.as_dataset
whenfile_format
is not"arrow"
by @mariosasko in #5915 - Refactor extensions by @albertvillanova in #5917
- Use more efficient and idiomatic way to construct list. by @ttsugriy in #5909
- Add
flatten_indices
toDatasetDict
by @maximxlss in #5907 - Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in #5920
- Make prepare_split more robust if errors in metadata dataset_info splits by @albertvillanova in #5901
- Fix streaming parquet with image feature in schema by @lhoestq in #5921
- canonicalize data dir in config ID hash by @kylrth in #5899
- Fix link to quickstart docs in README.md by @mariosasko in #5928
- Fix string-encoding, make
batch_size
optional, and minor improvements inDataset.to_tf_dataset
by @alvarobartt in #5883 - Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in #5863
- [doc build] Use secrets by @mishig25 in #5932
- Fix
to_numpy
when None values in the sequence by @qgallouedec in #5933 - Better row group size in push_to_hub by @lhoestq in #5935
- Avoid parallel redownload in cache by @albertvillanova in #5937
- Better filenotfound for gated by @lhoestq in #5954
- Make get_from_cache use custom temp filename that is locked by @albertvillanova in #5938
- Fix ArrowExamplesIterable.shard_data_sources by @lhoestq in #5956
- Add Arrow builder docs by @lhoestq in #5952
- Fix sequence of array support for most dtype by @qgallouedec in #5948
New Contributors
- @Laurent2916 made their first contribution in #5824
- @yuukicammy made their first contribution in #5810
- @baskrahmer made their first contribution in #5894
- @ttsugriy made their first contribution in #5909
- @maximxlss made their first contribution in #5907
- @mariusz-jachimowicz-83 made their first contribution in #5893
- @kylrth made their first contribution in #5899
- @qgallouedec made their first contribution in #5933
- @es94129 made their first contribution in #5924
Full Changelog: 2.12.0...zef
2.12.0
Datasets Features
-
Add Dataset.from_spark by @maddiedawson in #5701
- Get a Dataset from a Spark DataFrame (docs):
>>> from datasets import Dataset >>> ds = Dataset.from_spark(df)
-
Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in #5689
- Stream data from Wikipedia:
>>> from datasets import load_dataset >>> ds = load_dataset("wikipedia", "20220301.de", streaming=True) >>> next(iter(ds["train"])) {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
-
Implement sharding on merged iterable datasets by @Hubert-Bonisseur in #5735
- Use interleaved datasets in a distributed setup or with a DataLoader
>>> from datasets import load_dataset, interleave_datasets >>> from torch.utils.data import DataLoader >>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True) >>> c4 = load_dataset("c4", "en", split="train", streaming=True) >>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted") >>> dataloader = DataLoader(merged, num_workers=4)
-
Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in #5751
- Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
- Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
- Allow converting the variable-shaped ArrayND to Pandas
General improvements and bug fixes
- Fix a description error for interleave_datasets. by @QizhiPei in #5680
- [docs] Split pattern search order by @stevhliu in #5693
- Raise an error on missing distributed seed by @lhoestq in #5697
- Fix xnumpy_load for .npz files by @albertvillanova in #5714
- Temporarily pin fsspec by @albertvillanova in #5731
- Unpin fsspec by @albertvillanova in #5733
- Fix CI warnings by @albertvillanova in #5741
- Fix CI mock filesystem fixtures by @albertvillanova in #5740
- Fix link in docs by @bbbxyz in #5746
- fix typo: "mow" -> "now" by @csris in #5763
- [docs] Compress data files by @stevhliu in #5691
- Fix style by @lhoestq in #5774
- Minor tqdm fixes by @mariosasko in #5754
- Fixes #5757 by @eli-osherovich in #5758
- Fix JSON builder when missing keys in first row by @albertvillanova in #5772
- Warning specifying future change in to_tf_dataset behaviour by @amyeroberts in #5742
- Prepare tests for hfh 0.14 by @Wauplin in #5788
- Call fs.makedirs in save_to_disk by @lhoestq in #5779
- Allow to run CI on push to ci-branch by @albertvillanova in #5790
- Fix nondeterministic sharded data split order by @albertvillanova in #5729
- Raise subprocesses traceback when interrupting by @lhoestq in #5784
- Fix spark imports by @lhoestq in #5795
- Change downloaded file permission based on umask by @albertvillanova in #5800
- Fix inferring module for unsupported data files by @albertvillanova in #5787
- Reorder default data splits to have validation before test by @albertvillanova in #5718
- Validate non-empty data_files by @albertvillanova in #5802
- Spark docs by @lhoestq in #5796
- Release: 2.12.0 by @lhoestq in #5803
New Contributors
- @QizhiPei made their first contribution in #5680
- @bbbxyz made their first contribution in #5746
- @csris made their first contribution in #5763
- @eli-osherovich made their first contribution in #5758
- @maddiedawson made their first contribution in #5701
Full Changelog: 2.11.0...2.12.0
2.11.0
Important
- Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in #5573
- this allows to not have dependencies on pytorch to decode audio files
- this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
- Deprecated
batch_size
onDataset.to_dict()
Datasets Features
- Add writer_batch_size for ArrowBasedBuilder by @lhoestq in #5565
- allow to specofy the row group / record batch size when you
download_and_prepare()
a dataset
- allow to specofy the row group / record batch size when you
- Experimental support of cloud storage in
load_dataset()
: - Support PyArrow arrays as column values in
from_dict
by @mariosasko in #5643 - Allow direct cast from binary to Audio/Image by @mariosasko in #5644
- Add column_names to IterableDataset by @patrickloeber in #5582
- pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in #5569
- add Dataset.to_list by @kyoto7250 in #5611
General imrovements and bug fixes
- Update csv.py by @XDoubleU in #5562
- Remove instructions for
ffmpeg
system package installation on Colab by @polinaeterna in #5558 - Apply ruff flake8-comprehension checks by @Skylion007 in #5549
- Fix
datasets.load_from_disk
,DatasetDict.load_from_disk
andDataset.load_from_disk
by @alvarobartt in #5529 - Add pre-commit config yaml file to enable automatic code formatting by @polinaeterna in #5561
- Add
huggingface_hub
version to env cli command by @mariosasko in #5578 - Do no write index by default when exporting a dataset by @mariosasko in #5583
- Flatten dataset on the fly in
save_to_disk
by @mariosasko in #5588 - Fix
sort
with indices mapping by @mariosasko in #5587 - Fix docstring example by @stevhliu in #5592
- Fix push_to_hub with no dataset_infos by @lhoestq in #5598
- Don't compute checksums if not necessary in
datasets-cli test
by @lhoestq in #5603 - Update README logo by @gary149 in #5605
- Fix CI by temporarily pinning fsspec < 2023.3.0 by @albertvillanova in #5617
- Fix archive fs test by @lhoestq in #5614
- unpin fsspec by @lhoestq in #5619
- Bump pyarrow to 8.0.0 by @lhoestq in #5620
- Remove set_access_token usage + fail tests if FutureWarning by @Wauplin in #5623
- Fix outdated
verification_mode
values by @polinaeterna in #5607 - Adding Oracle Cloud to docs by @ahosler in #5621
- Fix CI: ignore C901 ("some_func" is to complex) in
ruff
by @polinaeterna in #5636 - add kwargs to index search by @SaulLu in #5628
- Less zip false positives by @lhoestq in #5640
- Allow self as key in
Features
by @mariosasko in #5646 - Bump hfh to 0.11.0 by @lhoestq in #5642
- Support streaming datasets with numpy.load by @albertvillanova in #5626
- Fix unnecessary dict comprehension by @albertvillanova in #5662
- Fix CI by temporarily pinning tensorflow < 2.12.0 by @albertvillanova in #5664
- Copy features by @lhoestq in #5652
- Improve features decoding in to_iterable_dataset by @lhoestq in #5655
- Fix
fsspec.open
when using an HTTP proxy by @bryant1410 in #5656 - Jax requires jaxlib by @lhoestq in #5667
- docs: Update num_shards docs to mention num_proc on Dataset and DatasetDict by @connor-henderson in #5658
- Allow loading/saving of FAISS index using fsspec by @Dref360 in #5526
- Fix verification_mode when ignore_verifications is passed by @albertvillanova in #5683
- Release: 2.11.0 by @lhoestq in #5684
New Contributors
- @XDoubleU made their first contribution in #5562
- @Skylion007 made their first contribution in #5549
- @Hubert-Bonisseur made their first contribution in #5569
- @ahosler made their first contribution in #5621
- @patrickloeber made their first contribution in #5582
- @SaulLu made their first contribution in #5628
- @connor-henderson made their first contribution in #5658
- @kyoto7250 made their first contribution in #5611
Full Changelog: 2.10.0...2.11.0
2.10.1
What's Changed
- Fix sort with indices mapping by @mariosasko #5587
- Fix
IndexError
when doingds.filter(...).sort(...)
ords.select(...).sort(...)
- Fix
Full Changelog: 2.10.0...2.10.1
2.10.0
Important
- Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
- Big improvements on the speed of
.flatten_indices()
(x2) +save/load_from_disk
(x100) on selected/shuffled datasets
- Big improvements on the speed of
- Skip dataset verifications by default by @mariosasko in #5303
- introduces multiple
verification_mode
you can pass to `load_dataset()): - the new default verification steps are much faster (no need to compute expensive checksums)
- introduces multiple
Datasets features
- Single TQDM bar in multi-proc map by @mariosasko in #5455
- No more stacked TQDM bars when calling
.map()
in multiprocessing
- No more stacked TQDM bars when calling
- Map-style Dataset to IterableDataset by @lhoestq in #5410
- introduces
.to_iterable_dataset()
to get aIterableDataset
from aDataset
- see all the advantages of
IterableDataset
in the documentation about the differences between Dataset and IterableDataset
- introduces
- Select columns of Dataset or DatasetDict by @daskol in #5480
- introduces
.select_column()
to return a dataset only containing the requested columns
- introduces
- Added functionality: sort datasets by multiple keys by @MichlF in #5502
- introduces
ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
- introduces
- Add JAX device selection when formatting by @alvarobartt in #5547
- introduces
ds = ds.with_format("jax", device=device)
- introduces
- Reload features from Parquet metadata by @MFreidank in #5516
- Speed up batched PyTorch DataLoader by @lhoestq in #5512
Documentation
- Add section in tutorial for IterableDataset by @stevhliu in #5485
- Tutorial for creating a dataset by @stevhliu in #5540
- Add JAX-formatting documentation by @alvarobartt in #5535
General improvements and bug fixes
- Pin sqlalchemy by @lhoestq in #5476
- Update dataset card creation by @stevhliu in #5470
- Add num_test_batches option by @amyeroberts in #5471
- Tip for recomputing metadata by @stevhliu in #5478
- Disable aiohttp requoting of redirection URL by @albertvillanova in #5459
- [MINOR] Typo by @cakiki in #5491
- Pin dill lower version by @albertvillanova in #5489
- Improved error message for gated/private repos by @osanseviero in #5497
- Update docs for
nyu_depth_v2
dataset by @awsaf49 in #5484 - don't zero copy timestamps by @dwyatte in #5504
- Remove unused
load_from_cache_file
arg fromDataset.shard()
docstring by @polinaeterna in #5493 - Do not add index column by default when exporting to CSV by @albertvillanova in #5490
- Fix bug when casting empty array to class labels by @marioga in #5521
- Fix benchmarks CI - pin protobuf by @lhoestq in #5527
- Remove py.typed by @mariosasko in #5518
- Add missing license in
NumpyFormatter
by @alvarobartt in #5530 - Unify
load_from_cache_file
type and logic by @HallerPatrick in #5515 - Format code with
ruff
by @mariosasko in #5519 - Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in #5522
- Resolve four broken refs in the docs by @tomaarsen in #5550
- Use default audio resampling type by @lhoestq in #5556
- resampy is no longer needed to resample audio data
- improved message error row formatting by @Plutone11011 in #5553
- Make tiktoken tokenizers hashable by @mariosasko in #5552
- Suggest scikit-learn instead of sklearn by @osbm in #5551
- Add filter desc by @lhoestq in #5557
- Fix map suffix_template by @lhoestq in #5559
- Ensure last tqdm update in map by @mariosasko in #5560
New Contributors
- @amyeroberts made their first contribution in #5471
- @awsaf49 made their first contribution in #5484
- @dwyatte made their first contribution in #5504
- @marioga made their first contribution in #5521
- @MFreidank made their first contribution in #5516
- @daskol made their first contribution in #5480
- @Plutone11011 made their first contribution in #5553
- @osbm made their first contribution in #5551
- @MichlF made their first contribution in #5502
Full Changelog: 2.9.0...ef