Releases · huggingface/datasets · GitHub

12 Jun 16:27

lhoestq

0.2.1

New datasets:

ELI5
CompGuessWhat?!
BookCorpus
Piaf
Allociné
BlendedSkillTalk

New features:

.filter method
option to do batching for metrics
make datasets deterministic

New commands:

nlp-cli upload_dataset
nlp-cli upload_metric
nlp-cli s3_datasets {ls,rm}
nlp-cli s3_metrics {ls,rm}

Assets 2

29 May 15:43

lhoestq

New datasets + Apache Beam, new metrics, bug fixes

Datasets changes

New: germeval14
New: wmt
New: Ubuntu dialog corpus
New: squad spanish
New: Quanta
New: arcd
New: Natural Questions (needs to be processed using a beam pipeline)
New: C4 (needs to be processed using a beam pipeline)
Skip the processing: wikipedia (english and french version are now already processed)
Skip the processing: wiki40b (english version is now already processed)
Renamed: anli -> art
Better instructions: xsum
Add .filter() for arrow datasets
Add instruction message for manual data when required

Metrics changes:

New: BERTScore
Allow to add examples by element or by batch to compute a metric score

Commands:

New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud

Bug fixes:

Now .map return the right values when run on different splits of the same dataset
Fix input of the squad metric format to fit the format of the squad dataset
Fix download from google drive for small files
For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing

More tests

Local tests of dataset processing scripts
AWS tests of dataset processing scripts
Tests for arrow dataset methods
Tests for arrow reader methods

Assets 2

15 May 11:48

thomwolf

First release

First release of the nlp library.

Read the README.md for an introduction: https://github.com/huggingface/nlp/blob/master/README.md

Tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb

This is a beta release and the API is not expected to be stabilized yet (in particular the API for the metrics).

Documentation and tests are also still sparse.

Assets 2