Skip to content

Releases: huggingface/datasets

2.14.3

03 Aug 10:31
33f736e
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.14.2...2.14.3

2.14.2

31 Jul 06:39
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.14.1...2.14.2

2.14.1

27 Jul 17:09
029956a
Compare
Choose a tag to compare

Bug fixes

Other improvements

Full Changelog: 2.14.0...2.14.1

2.14.0

24 Jul 15:54
88896a7
Compare
Choose a tag to compare

Important: caching

  • Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
  • Datasets that were already cached are still supported.
  • This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
  • This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.

Dataset Configuration

  • Support for multiple configs via metadata yaml info by @polinaeterna in #5331

    • Configure your dataset using YAML at the top of your dataset card (docs here)
    • Choose which file goes into which split
      ---
      configs:
      - config_name: default
        data_files:
        - split: train
           path: data.csv
        - split: test
            path: holdout.csv
      ---
    • Define multiple dataset configurations
      ---
      configs:
      - config_name: main_data
        data_files: main_data.csv
      - config_name: additional_data
        data_files: additional_data.csv
      ---

Dataset Features

  • Support for multiple configs via metadata yaml info by @polinaeterna in #5331

    • push_to_hub() additional dataset configurations
    ds.push_to_hub("username/dataset_name", config_name="additional_data")
    # reload later
    ds = load_dataset("username/dataset_name", "additional_data")
  • Support returning dataframe in map transform by @mariosasko in #5995

What's Changed

New Contributors

Full Changelog: 2.13.1...2.14.0

2.13.1

22 Jun 18:31
682d21e
Compare
Choose a tag to compare

General improvements and bug fixes

Full Changelog: 2.13.0...2.13.1

2.13.0

14 Jun 16:25
9aaee6f
Compare
Choose a tag to compare

Dataset Features

  • Add IterableDataset.from_spark by @maddiedawson in #5770

    • Stream the data from your Spark DataFrame directly to your training pipeline
    from datasets import IterableDataset
    from torch.utils.data import DataLoader
    
    ids = IterableDataset.from_spark(df)
    ids = ids.map(...).filter(...).with_format("torch")
    for batch in DataLoader(ids, batch_size=16, num_workers=4):
        ...
  • IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:

    from datasets import load_dataset
    
    ids = load_dataset("c4", "en", split="train", streaming=True)
    ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.
  • Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in #5893

    from datasets import IterableDataset
    
    ids = IterableDataset.from_file("path/to/data.arrow")
  • Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in #5944

    from datasets import load_dataset
    
    ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})

Experimental

  • Add parallel module using joblib for Spark by @es94129 in #5924

General improvements and bug fixes

New Contributors

Full Changelog: 2.12.0...zef

2.12.0

28 Apr 10:09
8e1af7b
Compare
Choose a tag to compare

Datasets Features

  • Add Dataset.from_spark by @maddiedawson in #5701

    • Get a Dataset from a Spark DataFrame (docs):
    >>> from datasets import Dataset
    >>> ds = Dataset.from_spark(df)
  • Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in #5689

    • Stream data from Wikipedia:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
    >>> next(iter(ds["train"]))
    {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
  • Implement sharding on merged iterable datasets by @Hubert-Bonisseur in #5735

    • Use interleaved datasets in a distributed setup or with a DataLoader
    >>> from datasets import load_dataset, interleave_datasets
    >>> from torch.utils.data import DataLoader
    >>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
    >>> c4 = load_dataset("c4", "en", split="train", streaming=True)
    >>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
    >>> dataloader = DataLoader(merged, num_workers=4)
  • Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in #5751

    • Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
    • Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
    • Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

New Contributors

Full Changelog: 2.11.0...2.12.0

2.11.0

29 Mar 18:23
3b16e08
Compare
Choose a tag to compare

Important

  • Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in #5573
    • this allows to not have dependencies on pytorch to decode audio files
    • this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
  • Deprecated batch_size on Dataset.to_dict()

Datasets Features

  • Add writer_batch_size for ArrowBasedBuilder by @lhoestq in #5565
    • allow to specofy the row group / record batch size when you download_and_prepare() a dataset
  • Experimental support of cloud storage in load_dataset():
  • Support PyArrow arrays as column values in from_dict by @mariosasko in #5643
  • Allow direct cast from binary to Audio/Image by @mariosasko in #5644
  • Add column_names to IterableDataset by @patrickloeber in #5582
  • pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in #5569
  • add Dataset.to_list by @kyoto7250 in #5611

General imrovements and bug fixes

New Contributors

Full Changelog: 2.10.0...2.11.0

2.10.1

28 Feb 18:08
2843fce
Compare
Choose a tag to compare

What's Changed

  • Fix sort with indices mapping by @mariosasko #5587
    • Fix IndexError when doing ds.filter(...).sort(...) or ds.select(...).sort(...)

Full Changelog: 2.10.0...2.10.1

2.10.0

22 Feb 12:58
cac733f
Compare
Choose a tag to compare

Important

  • Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
    • Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
  • Skip dataset verifications by default by @mariosasko in #5303
    • introduces multiple verification_mode you can pass to `load_dataset()):
    • the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

  • Single TQDM bar in multi-proc map by @mariosasko in #5455
    • No more stacked TQDM bars when calling .map() in multiprocessing
  • Map-style Dataset to IterableDataset by @lhoestq in #5410
  • Select columns of Dataset or DatasetDict by @daskol in #5480
    • introduces .select_column() to return a dataset only containing the requested columns
  • Added functionality: sort datasets by multiple keys by @MichlF in #5502
    • introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
  • Add JAX device selection when formatting by @alvarobartt in #5547
    • introduces ds = ds.with_format("jax", device=device)
  • Reload features from Parquet metadata by @MFreidank in #5516
  • Speed up batched PyTorch DataLoader by @lhoestq in #5512

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: 2.9.0...ef