Skip to content

Releases: huggingface/datasets

0.2.1

12 Jun 16:27
Compare
Choose a tag to compare

New datasets:

  • ELI5
  • CompGuessWhat?!
  • BookCorpus
  • Piaf
  • Allociné
  • BlendedSkillTalk

New features:

  • .filter method
  • option to do batching for metrics
  • make datasets deterministic

New commands:

  • nlp-cli upload_dataset
  • nlp-cli upload_metric
  • nlp-cli s3_datasets {ls,rm}
  • nlp-cli s3_metrics {ls,rm}

New datasets + Apache Beam, new metrics, bug fixes

29 May 15:43
Compare
Choose a tag to compare

Datasets changes

  • New: germeval14
  • New: wmt
  • New: Ubuntu dialog corpus
  • New: squad spanish
  • New: Quanta
  • New: arcd
  • New: Natural Questions (needs to be processed using a beam pipeline)
  • New: C4 (needs to be processed using a beam pipeline)
  • Skip the processing: wikipedia (english and french version are now already processed)
  • Skip the processing: wiki40b (english version is now already processed)
  • Renamed: anli -> art
  • Better instructions: xsum
  • Add .filter() for arrow datasets
  • Add instruction message for manual data when required

Metrics changes:

  • New: BERTScore
  • Allow to add examples by element or by batch to compute a metric score

Commands:

  • New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
  • New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud

Bug fixes:

  • Now .map return the right values when run on different splits of the same dataset
  • Fix input of the squad metric format to fit the format of the squad dataset
  • Fix download from google drive for small files
  • For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing

More tests

  • Local tests of dataset processing scripts
  • AWS tests of dataset processing scripts
  • Tests for arrow dataset methods
  • Tests for arrow reader methods

First release

15 May 11:48
Compare
Choose a tag to compare

First release of the nlp library.

Read the README.md for an introduction: https://github.com/huggingface/nlp/blob/master/README.md

Tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb

This is a beta release and the API is not expected to be stabilized yet (in particular the API for the metrics).

Documentation and tests are also still sparse.