Skip to content

Releases: huggingface/datasets

1.9.0

05 Jul 17:25
Compare
Choose a tag to compare

Datasets Changes

Datasets Features

Task templates

  • Add task templates for tydiqa and xquad #2518 (@lewtun)
  • Insert text classification template for Emotion dataset #2521 (@lewtun)
  • Add summarization template #2529 (@lewtun)
  • Add task template for automatic speech recognition #2533 (@lewtun)
  • Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
  • Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

Dataset cards

Docs

1.8.0

08 Jun 18:23
Compare
Choose a tag to compare

Datasets Changes

Datasets Features

  • Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
  • Support sliced list arrays in cast #2461 (@lhoestq)
    • Dataset.cast can now change the feature types of Sequence fields
  • Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
    • we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
    • we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
    • users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

  • Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
  • Insert task templates for text classification #2389 (@lewtun)
  • Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
  • Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)

1.7.0

27 May 10:00
Compare
Choose a tag to compare

Dataset Changes

Dataset Features

Metric Changes

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

1.6.2

30 Apr 13:20
Compare
Choose a tag to compare

Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.

Breaking change:

  • when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.

1.6.1

26 Apr 13:33
Compare
Choose a tag to compare

Fix memory issue in multiprocessing: Don't pickle table index #2264 (@lhoestq)

1.6.0

20 Apr 17:05
Compare
Choose a tag to compare

Dataset changes

Dataset features

Metrics changes

Dataset cards

General improvements and bug fixes

Docs

1.5.0

18 Mar 14:21
Compare
Choose a tag to compare

Datasets changes

Datasets Features

General Bug fixes and improvements

1.4.1

04 Mar 09:16
Compare
Choose a tag to compare

Fix an issue #1981 with WMT downloads #1982 (@albertvillanova)

1.4.0

03 Mar 17:13
Compare
Choose a tag to compare

Datasets Changes

Datasets Features

  • Add to_dict and to_pandas for Dataset #1889 (@SBrandeis)
  • Add to_csv for Dataset #1887 (@SBrandeis)
  • Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
  • Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
    • This introduces new methods for Dataset objects: rename_column, remove_columns, flatten and cast.
    • The old in-place methods rename_column_, remove_columns_, flatten_ and cast_ are now deprecated.
  • Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
  • Add cross-platform support for datasets-cli #1951 (@mariosasko)

Metrics Changes

Offline loading

General improvements and bugfixes

1.3.0

15 Feb 16:54
Compare
Choose a tag to compare

Dataset Features

  • On-the-fly data transforms (#1795)
  • ADD S3 support for downloading and uploading processed datasets (#1723)
  • Allow loading dataset in-memory (#1792)
  • Support future datasets (#1813)
  • Enable/disable caching (#1703)
  • Offline dataset loading (#1726)

Datasets Hub Features

Dataset Changes

  • New: LJ Speech (#1878)
  • New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
  • New: cord 19 (#1850)
  • New: Tweet Eval Dataset (#1829)
  • New: CIFAR-100 Dataset (#1812)
  • New: SICK (#1804)
  • New: BBC Hindi NLI Dataset (#1158)
  • New: Freebase QA Dataset (#1814)
  • New: Arabic sarcasm (#1798)
  • New: Semantic Scholar Open Research Corpus (#1606)
  • New: DuoRC Dataset (#1800)
  • New: Aggregated dataset for the GEM benchmark (#1807)
  • New: CC-News dataset of English language articles (#1323)
  • New: irc disentangle (#1586)
  • New: Narrative QA Manual (#1778)
  • New: Universal Morphologies (#1174)
  • New: SILICONE (#1761)
  • New: Librispeech ASR (#1767)
  • New: OSCAR (#1694, #1868, #1833)
  • New: CANER Corpus (#1684)
  • New: Arabic Speech Corpus (#1852)
  • New: id_liputan6 (#1740)
  • New: Stuctured Argument Extraction for Korean dataset (#1748)
  • New: TurkCorpus (#1732)
  • New: Hatexplain Dataset (#1716)
  • New: adversarialQA (#1714)
  • Update: Doc2dial - reading comprehension update to latest version (#1816)
  • Update: OPUS Open Subtitles - add with metadata information (#1865)
  • Update: SWDA - use all metadata features(#1799)
  • Update: SWDA - add metadata and correct splits (#1749)
  • Update: CommonGen - update citation information (#1787)
  • Update: SciFact - update URL (#1780)
  • Update: BrWaC - update features name (#1736)
  • Update: TLC - update urls to be github links (#1737)
  • Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
  • Fix: multi_woz_v22 - fix checksums (#1880)
  • Fix: limit - fix url (#1861)
  • Fix: WebNLG - fix test test + more field (#1739)
  • Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
  • Fix: reuters - add missing "brief" entries (#1744)
  • Fix: thainer: empty token bug (#1734)
  • Fix: lst20: empty token bug (#1734)

Metrics Changes

  • New: Word Error Metric (#1847)
  • New: COMET (#1577, #1753)
  • Fix: bert_score - set version dependency (#1851)

Metric Docs

  • Add metrics usage examples and tests (#1820)

CLI Changes

  • [BREAKING] remove outdated commands (#1869):
    • remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
    • instead, use the huggingface-hub CLI

Bug fixes

  • fix writing GPU Faiss index (#1862)
  • update pyarrow import warning (#1782)
  • Ignore definition line number of functions for caching (#1779)
  • update saving and loading methods for faiss index so to accept path like objects (#1663)
  • Print error message with filename when malformed CSV (#1826)
  • Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)

Refactoring

  • Refactoring: Create config module (#1848)
  • Use a config id in the cache directory names for custom configs (#1754)

Logging

  • Enable logging propagation and remove logging handler (#1845)