05 Jul 17:25

lhoestq

5bc064d

1.9.0

Datasets Changes

New: C4 #2575 #2592 (@lhoestq)
New: mC4 #2576 (@lhoestq)
New: MasakhaNER #2465 (@dadelani)
New: Eduge #2492 (@enod)
Update: xor_tydi_qa - update version #2455 (@cccntu)
Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
Update: udpos - change features structure #2466 (@JerryIsHere)
Update: WebNLG - update checksums #2558 (@lhoestq)
Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
Fix: proto_qa - fix download link #2463 (@mariosasko)
Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
Fix: code_search_net - fix keys #2555 (@lhoestq)
Fix: discofuse - fix link cc #2541 (@VictorSanh)
Fix: fever - fix keys #2557 (@lhoestq)

Datasets Features

Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
JAX integration #2502 (@lhoestq)
Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
Set configurable downloaded datasets path #2488 (@albertvillanova)
Set configurable extracted datasets path #2487 (@albertvillanova)
Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
Add interleave_datasets for map-style datasets #2568 (@lhoestq)
Add load_dataset_builder #2500 (@mariosasko)
Support Zstandard compressed files #2578 (@albertvillanova)

Task templates

Add task templates for tydiqa and xquad #2518 (@lewtun)
Insert text classification template for Emotion dataset #2521 (@lewtun)
Add summarization template #2529 (@lewtun)
Add task template for automatic speech recognition #2533 (@lewtun)
Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
Allow latest pyarrow version #2490 (@albertvillanova)
Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
Add Zenodo metadata file with license #2501 (@albertvillanova)
add tensorflow-macos support #2493 (@slayerjain)
Keep original features order #2453 (@albertvillanova)
Add course banner #2506 (@sgugger)
Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
Improve performance of pandas arrow extractor #2519 (@albertvillanova)
Fix fingerprint when moving cache dir #2509 (@lhoestq)
Replace bad n>1M size tag #2527 (@lhoestq)
Fix dev version #2531 (@lhoestq)
Sync with transformers disabling NOTSET #2534 (@albertvillanova)
Fix logging levels #2544 (@albertvillanova)
Add support for Split.ALL #2259 (@mariosasko)
Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
Make numpy arrow extractor faster #2505 (@lhoestq)
fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
Add ASR task and new languages to resources #2567 (@lewtun)
Filter expected warning log from transformers #2571 (@albertvillanova)
Fix BibTeX entry #2579 (@albertvillanova)
Fix Counter import #2580 (@albertvillanova)
Add aiohttp to tests extras require #2587 (@albertvillanova)
Add language tags #2590 (@lewtun)
Support pandas 1.3.0 read_csv #2593 (@lhoestq)

Dataset cards

Updated Dataset Description #2420 (@binny-mathew)
Update DatasetMetadata and ReadMe #2436 (@gchhablani)
CRD3 dataset card #2515 (@wilsonyhlee)
Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)

Docs

no s at load_datasets #2479 (@julien-c)
Fix docs custom stable version #2477 (@albertvillanova)
Improve Features docs #2535 (@albertvillanova)
Update README.md #2414 (@cryoff)
Fix FileSystems documentation #2551 (@connor-mccarthy)
Minor fix in loading metrics docs #2562 (@albertvillanova)
Minor fix docs format for bertscore #2570 (@albertvillanova)
Add streaming in load a dataset docs #2574 (@lhoestq)

Assets 2

08 Jun 18:23

lhoestq

1.8.0

bcf0543

1.8.0

Datasets Changes

New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
New: KLUE benchmark #2416 (@jungwhank)
New: HendrycksTest #2370 (@andyzoujm)
Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
Fix: flores - fix download link #2448 (@mariosasko)

Datasets Features

Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
Support sliced list arrays in cast #2461 (@lhoestq)
- Dataset.cast can now change the feature types of Sequence fields
Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

adds license information for DailyDialog. #2419 (@aditya2211)
add english language tags for ~100 datasets #2442 (@VictorSanh)
Add copyright info to MLSUM dataset #2427 (@PhilipMay)
Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)

General improvements and bug fixes

Add DOI badge to README #2411 (@albertvillanova)
Make datasets PEP-561 compliant #2417 (@SBrandeis)
Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
Fix CI six installation on linux #2432 (@lhoestq)
Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
add utf-8 while reading README #2418 (@bhavitvyamalik)
Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
Rename config and environment variable for in memory max size #2454 (@albertvillanova)
Add version-specific BibTeX #2430 (@albertvillanova)
Fix cross-reference typos in documentation #2456 (@albertvillanova)
Better error message when using the wrong load_from_disk #2437 (@lhoestq)

Experimental and work in progress: Format a dataset for specific tasks

Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
Insert task templates for text classification #2389 (@lewtun)
Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)

Assets 2

27 May 10:00

lhoestq

1.7.0

448c177

1.7.0

Dataset Changes

New: NLU evaluation data #2238 (@dkajtoch)
New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
New: Bbaw egyptian #2290 (@phiwi)
New: GooAQ #2260 (@bhavitvyamalik)
New: SubjQA #2302 (@lewtun)
New: Ascent KB #2341, #2349 (@phongnt570)
New: HLGD #2325 (@tingofurro)
New: Qasper #2346 (@cceyda)
New: ConvQuestions benchmark #2372 (@PhilippChr)
Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
Update multi_woz_v22 - update checksum #2281 (@lhoestq)
Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
Update: web_science - fixed download link #2338 (@bhavitvyamalik)
Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
Update: conll2003 - correct labels #2369 (@philschmid)
Update: offenseval_dravidian - update citations #2385 (@adeepH)
Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
Fix: head_qa - Fix keys #2408 (@lhoestq)

Dataset Features

Implement Dataset add_item #1870 (@albertvillanova)
Implement Dataset add_column #2145 (@albertvillanova)
Implement Dataset to JSON #2248, #2352 (@albertvillanova)
Add rename_columnS method #2312 (@SBrandeis)
add desc to tqdm in Dataset.map() #2374 (@bhavitvyamalik)
Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)

Metric Changes

New: CUAD metrics #2273 (@bhavitvyamalik)
New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
Update: CER - Docs, CER above 1 #2342 (@borisdayma)

General improvements and bug fixes

Update black #2265 (@lhoestq)
Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
Fix query table with iterable #2269 (@lhoestq)
Perform minor refactoring: use config #2253 (@albertvillanova)
Update format, fingerprint and indices after add_item #2254 (@lhoestq)
Always update metadata in arrow schema #2274 (@lhoestq)
Make tests run faster #2266 (@lhoestq)
Fix metadata validation with config names #2286 (@lhoestq)
Fixed typo seperate->separate #2292 (@laksh9950)
Allow collaborators to self-assign issues #2289 (@albertvillanova)
Mapping in the distributed setting #2298 (@TevenLeScao)
Fix conda release #2309 (@lhoestq)
Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
Set default name in init_dynamic_modules #2320 (@albertvillanova)
Fix duplicate keys #2333 (@lhoestq)
Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
Metadata validation #2107 (@theo-m)
Add Validation For README #2121 (@gchhablani)
Fix overflow issue in interpolation search #2336 (@mariosasko)
Datasets cli improvements #2315 (@mariosasko)
Add key type and duplicates verification with hashing #2245 (@NikhilBartwal)
More consistent copy logic #2340 (@mariosasko)
Update README vallidation rules #2353 (@gchhablani)
normalized TOCs and titles in data cards #2355 (@yjernite)
simpllify faiss index save #2351 (@Guitaricet)
Allow "other-X" in licenses #2368 (@gchhablani)
Improve ReadInstruction logic and update docs #2261 (@mariosasko)
Disallow duplicate keys in yaml tags #2379 (@lhoestq)
maintain YAML structure reading from README #2380 (@bhavitvyamalik)
add dataset card title #2381 (@bhavitvyamalik)
Add tests for dataset cards #2348 (@gchhablani)
Improve example in rounding docs #2383 (@mariosasko)
Paperswithcode dataset mapping #2404 (@julien-c)
Free datasets with cache file in temp dir on exit #2403 (@mariosasko)

Experimental and work in progress: Format a dataset for specific tasks

Task formatting for text classification & question answering #2255 (@SBrandeis)
Add check for task templates on dataset load #2390 (@lewtun)
Add args description to DatasetInfo #2384 (@lewtun)
Improve task api code quality #2376 (@mariosasko)

Assets 2

30 Apr 13:20

lhoestq

1.6.2

b0d7ae1

1.6.2

Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.

Breaking change:

when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.

Assets 2

26 Apr 13:33

lhoestq

1.6.1

e8fc41f

1.6.1

Fix memory issue in multiprocessing: Don't pickle table index #2264 (@lhoestq)

Assets 2

20 Apr 17:05

lhoestq

1.6.0

40bb9e6

1.6.0

Dataset changes

New: MOROCO #2002 (@MihaelaGaman)
New: CBT dataset #2044 (@gchhablani)
New: MDD Dataset #2051 (@gchhablani)
New: Multilingual dIalogAct benchMark (miam) #2047 (@eusip)
New: bAbI QA tasks #2053 (@gchhablani)
New: machine translated multilingual STS benchmark dataset #2090 (@PhilipMay)
New: EURLEX legal NLP dataset #2114 (@iliaschalkidis)
New: ECtHR legal NLP dataset #2114 (@iliaschalkidis)
New: EU-REG-IR legal NLP dataset #2114 (@iliaschalkidis)
New: NorNE dataset for Norwegian POS and NER #2154 (@versae)
New: banking77 #2140 (@dkajtoch)
New: OpenSLR #2173 #2215 #2221 (@cahya-wirawan)
New: CUAD dataset #2219 (@bhavitvyamalik)
Update: Gem V1.1 + new challenge sets#2142 #2186 (@yjernite)
Update: Wikiann - added spans field #2141 (@rabeehk)
Update: XTREME - Add tel to xtreme tatoeba #2180 (@lhoestq)
Update: GLUE MRPC - added real label to test set #2216 (@philschmid)
Fix: MultiWoz22 - fix dialogue action slot name and value #2136 (@adamlin120)
Fix: wikiauto - fix link #2171 (@mounicam)
Fix: wino_bias - use right splits #1930 (@JieyuZhao)
Fix: lc_quad - update download checksum #2213 (@mariosasko)
Fix newsgroup -fix one instance of 'train' to 'test' #2225 (@alexwdong)
Fix: xnli - fix tuple key #2233 (@NikhilBartwal)

Dataset features

Allow stateful function in dataset.map #1960 (@mariosasko)
MIAM dataset - new citation details #2101 (@eusip)
[Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset #2025 (@lhoestq)
Allow pickling of big in-memory tables #2150 (@lhoestq)
updated user permissions based on umask #2086 #2157 (@bhavitvyamalik)
Fast table queries with interpolation search #2122 (@lhoestq)
Concat only unique fields in DatasetInfo.from_merge #2163 (@mariosasko)
Implementation of class_encode_column #2184 #2227 (@SBrandeis)
Add support for axis in concatenate datasets #2151 (@albertvillanova)
Set default in-memory value depending on the dataset size #2182 (@albertvillanova)

Metrics changes

New: CER metric #2138 (@chutaklee)
Update: WER - Compute metric iteratively #2111 (@albertvillanova)
Update: seqeval - configurable options to seqeval metric #2204 (@marrodion)

Dataset cards

REFreSD: Updated card using information from data statement and datasheet #2082 (@mcmillanmajora)
Winobiais: fix split infos #2152 (@JieyuZhao)
all: Fix size categories in YAML Tags #2074 (@gchhablani)
LinCE: Updating citation information on LinCE readme #2205 (@gaguilar)
Swda: Update README.md #2235 (@PierreColombo)

General improvements and bug fixes

Refactorize Metric.compute signature to force keyword arguments only #2079 (@albertvillanova)
Fix max_wait_time in requests #2085 (@lhoestq)
Fix copy snippet in docs #2091 (@mariosasko)
Fix deprecated warning message and docstring #2100 (@albertvillanova)
Move Dataset.to_csv to csv module #2102 (@albertvillanova)
Fix: Allows a feature to be named "_type" #2093 (@dcfidalgo)
copy.deepcopy os.environ instead of copy #2119 (@NihalHarish)
Replace legacy torch.Tensor constructor with torch.tensor #2126 (@mariosasko)
Implement Dataset as context manager #2113 (@albertvillanova)
Fix missing infos from concurrent dataset loading #2137 (@lhoestq)
Pin fsspec lower than 0.9.0 #2172 (@lhoestq)
Replace assertTrue(isinstance with assertIsInstance in tests #2164 (@mariosasko)
add social thumbnial #2177 (@philschmid)
Fix s3fs tests for py36 and py37+ #2183 (@lhoestq)
Fix typo in huggingface hub #2192 (@LysandreJik)
Update metadata if dataset features are modified #2087 (@mariosasko)
fix missing indices_files in load_form_disk #2197 (@lhoestq)
Fix backward compatibility in Dataset.load_from_disk #2199 (@albertvillanova)
Fix ArrowWriter overwriting features in ArrowBasedBuilder #2201 (@lhoestq)
Fix incorrect assertion in builder.py #2110 (@dreamgonfly)
Remove Python2 leftovers #2208 (@mariosasko)
Revert breaking change in cache_files property #2217 (@lhoestq)
Set test cache config #2223 (@albertvillanova)
Fix map when removing columns on a formatted dataset #2231 (@lhoestq)
Refactorize tests to use Dataset as context manager #2191 (@albertvillanova)
Preserve split type when reloading dataset #2168 (@mariosasko)

Docs

make documentation more clear to use different cloud storage #2127 (@philschmid)
Render docstring return type as inline #2147 (@albertvillanova)
Add table classes to the documentation #2155 (@lhoestq)
Pin docutils for better doc #2174 (@sgugger)
Fix docstrings issues #2081 (@albertvillanova)
Add code of conduct to the project #2209 (@albertvillanova)
Add classes GenerateMode, DownloadConfig and Version to the documentation #2202 (@albertvillanova)
Fix bash snippet formatting in ADD_NEW_DATASET.md #2234 (@mariosasko)

Assets 2

18 Mar 14:21

lhoestq

1.5.0

f256b77

1.5.0

Datasets changes

New: Europarl Bilingual #1874 (@lucadiliello)
New: Stanford Sentiment Treebank #1961 (@patpizio)
New: RO-STS #1978 (@lorinczb)
New: newspop #1871 (@frankier)
New: FashionMNIST #1999 (@gchhablani)
New: Common voice #1886 (@BirgerMoell), #2063 (@patrickvonplaten)
New: Cryptonite #2013 (@theo-m)
New: RoSent #2011 (@gchhablani)
New: PersiNLU reading-comprehension #2028 (@danyaljj)
New: conllpp #1991 (@ZihanWangKi)
New: LaRoSeDa #2004 (@MihaelaGaman)
Update: unnecessary docstart check in conll-like datasets #2020 (@mariosasko)
Update: semeval 2020 task 11 - add article_id and process test set template #1979 (@hemildesai)
Update: Md gender - card update #2018 (@mcmillanmajora)
Update: XQuAD - add Romanian #2023 (@M-Salti)
Update: DROP - all answers #1980 (@KaijuML)
Fix: TIMIT ASR - Make sure not only the first sample is used #1995 (@patrickvonplaten)
Fix: Wikipedia - save memory by replacing root.clear with elem.clear #2037 (@miyamonz)
Fix: Doc2dial update data_infos and data_loaders #2041 (@songfeng)
Fix: ZEST - update download link #2057 (@matt-peters)
Fix: ted_talks_iwslt - fix version error #2064 (@mariosasko)

Datasets Features

Implement Dataset from CSV #1946 (@albertvillanova)
Implement Dataset from JSON and JSON Lines #1943 (@albertvillanova)
Implement Dataset from text #2030 (@albertvillanova)
Optimize int precision for tokenization #1985 (@albertvillanova)
- This allows to save 75%+ of space when tokenizing a dataset

General Bug fixes and improvements

Fix ArrowWriter closes stream at exit #1971 (@albertvillanova)
feat(docs): navigate with left/right arrow keys #1974 (@ydcjeff)
Fix various typos/grammer in the docs #2008 (@mariosasko)
Update format columns in Dataset.rename_columns #2027 (@mariosasko)
Replace print with logging in dataset scripts #2019 (@mariosasko)
Raise an error for outdated sacrebleu versions #2033 (@lhoestq)
Not all languages have 2 digit codes. #2016 (@asiddhant)
Fix arrow memory checks issue in tests #2042 (@lhoestq)
Support pickle protocol for dataset splits defined as ReadInstruction #2043 (@mariosasko)
Preserve column ordering in Dataset.rename_column #2045 (@mariosasko)
Fix text-classification tags #2049 (@gchhablani)
Fix docstring rendering of Dataset/DatasetDict.from_csv args #2066 (@albertvillanova)
Fixes check of TF_AVAILABLE and TORCH_AVAILABLE #2073 (@philschmid)
Add and fix docstring for NamedSplit #2069 (@albertvillanova)
Bump huggingface_hub version #2077 (@SBrandeis)
Fix docstring issues #2072 (@albertvillanova)

Assets 2

04 Mar 09:16

lhoestq

1.4.1

ca41320

1.4.1

Fix an issue #1981 with WMT downloads #1982 (@albertvillanova)

Assets 2

03 Mar 17:13

lhoestq

1.4.0

f42658e

1.4.0

Datasets Changes

New: iapp_wiki_qa_squad #1873 (@cstorm125)
New: Financial PhraseBank #1866 (@frankier)
New: CoVoST2 #1935 (@patil-suraj)
New: TIMIT #1903 (@vrindaprabhu)
New: Mlama (multilingual lama) #1931 (@pdufter)
New: FewRel #1823 (@gchhablani)
New: CCAligned Multilingual Dataset #1815 (@gchhablani)
New: Turkish News Category Lite #1967 (@yavuzKomecoglu)
Update: WMT - use mirror links #1912 for better download speed (@lhoestq)
Update: multi_nli - add missing fields #1950 (@bhavitvyamalik)
Fix: ALT - fix duplicated examples in alt-parallel #1899 (@lhoestq)
Fix: WMT datasets - fix download errors #1901 (@YangWang92), #1902 (@lhoestq)
Fix: QA4MRE - fix download URLs #1918 (@M-Salti)
Fix: Wiki_dpr - fix when with_embeddings is False or index_name is "no_index" #1925 (@lhoestq)
Fix: Wiki_dpr - add missing scalar quantizer #1926 (@lhoestq)
Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM #1970 (@yjernite)

Datasets Features

Add to_dict and to_pandas for Dataset #1889 (@SBrandeis)
Add to_csv for Dataset #1887 (@SBrandeis)
Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
- This introduces new methods for Dataset objects: rename_column, remove_columns, flatten and cast.
- The old in-place methods rename_column_, remove_columns_, flatten_ and cast_ are now deprecated.
Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
Add cross-platform support for datasets-cli #1951 (@mariosasko)

Metrics Changes

New: sari metric #1875 (@ddhruvkr)

Offline loading

Handle timeouts #1952 (@lhoestq)
Add datasets full offline mode with HF_DATASETS_OFFLINE #1976 (@lhoestq)

General improvements and bugfixes

Replace flatten_nested #1879 (@albertvillanova)
add missing info on how to add large files #1885 (@stas00)
Docs for adding new column on formatted dataset #1888 (@lhoestq)
Fix PandasArrayExtensionArray conversion to native type #1897 (@lhoestq)
Bugfix for string_to_arrow timestamp[ns] support #1900 (@justin-yan)
Fix to_pandas for boolean ArrayXD #1904 (@lhoestq)
Fix logging imports and make all datasets use library logger #1914 (@albertvillanova)
Standardizing datasets dtypes #1921 (@justin-yan)
Remove unused py_utils objects #1916 (@albertvillanova)
Fix save_to_disk with relative path #1923 (@lhoestq)
Updating old cards #1928 (@mcmillanmajora)
Improve typing and style and fix some inconsistencies #1929 (@mariosasko)
Fix builder config creation with data_dir #1932 (@lhoestq)
Disallow ClassLabel with no names #1938 (@lhoestq)
Update documentation with not in place transforms and update DatasetDict #1947 (@lhoestq)
Documentation for to_csv, to_pandas and to_dict #1953 (@lhoestq)
typos + grammar #1955 (@stas00)
Fix unused arguments #1962 (@mariosasko)
Fix metrics collision in separate multiprocessed experiments #1966 (@lhoestq)

Assets 2

15 Feb 16:54

lhoestq

1.3.0

ef633da

1.3.0

Dataset Features

On-the-fly data transforms (#1795)
ADD S3 support for downloading and uploading processed datasets (#1723)
Allow loading dataset in-memory (#1792)
Support future datasets (#1813)
Enable/disable caching (#1703)
Offline dataset loading (#1726)

Datasets Hub Features

Loading from the Datasets Hub (#1860)
This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library.
Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation

Dataset Changes

New: LJ Speech (#1878)
New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
New: cord 19 (#1850)
New: Tweet Eval Dataset (#1829)
New: CIFAR-100 Dataset (#1812)
New: SICK (#1804)
New: BBC Hindi NLI Dataset (#1158)
New: Freebase QA Dataset (#1814)
New: Arabic sarcasm (#1798)
New: Semantic Scholar Open Research Corpus (#1606)
New: DuoRC Dataset (#1800)
New: Aggregated dataset for the GEM benchmark (#1807)
New: CC-News dataset of English language articles (#1323)
New: irc disentangle (#1586)
New: Narrative QA Manual (#1778)
New: Universal Morphologies (#1174)
New: SILICONE (#1761)
New: Librispeech ASR (#1767)
New: OSCAR (#1694, #1868, #1833)
New: CANER Corpus (#1684)
New: Arabic Speech Corpus (#1852)
New: id_liputan6 (#1740)
New: Stuctured Argument Extraction for Korean dataset (#1748)
New: TurkCorpus (#1732)
New: Hatexplain Dataset (#1716)
New: adversarialQA (#1714)
Update: Doc2dial - reading comprehension update to latest version (#1816)
Update: OPUS Open Subtitles - add with metadata information (#1865)
Update: SWDA - use all metadata features(#1799)
Update: SWDA - add metadata and correct splits (#1749)
Update: CommonGen - update citation information (#1787)
Update: SciFact - update URL (#1780)
Update: BrWaC - update features name (#1736)
Update: TLC - update urls to be github links (#1737)
Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
Fix: multi_woz_v22 - fix checksums (#1880)
Fix: limit - fix url (#1861)
Fix: WebNLG - fix test test + more field (#1739)
Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
Fix: reuters - add missing "brief" entries (#1744)
Fix: thainer: empty token bug (#1734)
Fix: lst20: empty token bug (#1734)

Metrics Changes

New: Word Error Metric (#1847)
New: COMET (#1577, #1753)
Fix: bert_score - set version dependency (#1851)

Metric Docs

Add metrics usage examples and tests (#1820)

CLI Changes

[BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
- instead, use the huggingface-hub CLI

Bug fixes

fix writing GPU Faiss index (#1862)
update pyarrow import warning (#1782)
Ignore definition line number of functions for caching (#1779)
update saving and loading methods for faiss index so to accept path like objects (#1663)
Print error message with filename when malformed CSV (#1826)
Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)

Refactoring

Refactoring: Create config module (#1848)
Use a config id in the cache directory names for custom configs (#1754)

Logging

Enable logging propagation and remove logging handler (#1845)

Assets 2

Releases: huggingface/datasets

1.9.0

Datasets Changes

Datasets Features

Task templates

General improvements and bug fixes

Dataset cards

Docs

1.8.0

Datasets Changes

Datasets Features

Datasets Cards

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

1.7.0

Dataset Changes

Dataset Features

Metric Changes

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

1.6.2

1.6.1

1.6.0

Dataset changes

Dataset features

Metrics changes

Dataset cards

General improvements and bug fixes

Docs

1.5.0

Datasets changes

Datasets Features

General Bug fixes and improvements

1.4.1

1.4.0

Datasets Changes

Datasets Features

Metrics Changes

Offline loading

General improvements and bugfixes

1.3.0

Dataset Features

Datasets Hub Features

Dataset Changes

Metrics Changes

Metric Docs

CLI Changes

Bug fixes

Refactoring

Logging