Releases: huggingface/datasets
Releases · huggingface/datasets
0.2.1
New datasets:
- ELI5
- CompGuessWhat?!
- BookCorpus
- Piaf
- Allociné
- BlendedSkillTalk
New features:
- .filter method
- option to do batching for metrics
- make datasets deterministic
New commands:
- nlp-cli upload_dataset
- nlp-cli upload_metric
- nlp-cli s3_datasets {ls,rm}
- nlp-cli s3_metrics {ls,rm}
New datasets + Apache Beam, new metrics, bug fixes
Datasets changes
- New: germeval14
- New: wmt
- New: Ubuntu dialog corpus
- New: squad spanish
- New: Quanta
- New: arcd
- New: Natural Questions (needs to be processed using a beam pipeline)
- New: C4 (needs to be processed using a beam pipeline)
- Skip the processing: wikipedia (english and french version are now already processed)
- Skip the processing: wiki40b (english version is now already processed)
- Renamed: anli -> art
- Better instructions: xsum
- Add .filter() for arrow datasets
- Add instruction message for manual data when required
Metrics changes:
- New: BERTScore
- Allow to add examples by element or by batch to compute a metric score
Commands:
- New: nlp-cli dummy_data: to help generate dummy data files to test dataset scripts
- New: nlp-cli run_beam: to run an apache beam pipeline to process a dataset in the cloud
Bug fixes:
- Now .map return the right values when run on different splits of the same dataset
- Fix input of the squad metric format to fit the format of the squad dataset
- Fix download from google drive for small files
- For datasets like glue or scientific paper, force the user to pick one sub-dataset to make things less confusing
More tests
- Local tests of dataset processing scripts
- AWS tests of dataset processing scripts
- Tests for arrow dataset methods
- Tests for arrow reader methods
First release
First release of the nlp
library.
Read the README.md for an introduction: https://github.com/huggingface/nlp/blob/master/README.md
Tutorial: https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb
This is a beta release and the API is not expected to be stabilized yet (in particular the API for the metrics).
Documentation and tests are also still sparse.