Releases: huggingface/datasets
Releases · huggingface/datasets
1.9.0
Datasets Changes
- New: C4 #2575 #2592 (@lhoestq)
- New: mC4 #2576 (@lhoestq)
- New: MasakhaNER #2465 (@dadelani)
- New: Eduge #2492 (@enod)
- Update: xor_tydi_qa - update version #2455 (@cccntu)
- Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
- Update: udpos - change features structure #2466 (@JerryIsHere)
- Update: WebNLG - update checksums #2558 (@lhoestq)
- Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
- Fix: proto_qa - fix download link #2463 (@mariosasko)
- Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
- Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
- Fix: code_search_net - fix keys #2555 (@lhoestq)
- Fix: discofuse - fix link cc #2541 (@VictorSanh)
- Fix: fever - fix keys #2557 (@lhoestq)
Datasets Features
- Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
- JAX integration #2502 (@lhoestq)
- Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
- Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
- Set configurable downloaded datasets path #2488 (@albertvillanova)
- Set configurable extracted datasets path #2487 (@albertvillanova)
- Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
- Add interleave_datasets for map-style datasets #2568 (@lhoestq)
- Add load_dataset_builder #2500 (@mariosasko)
- Support Zstandard compressed files #2578 (@albertvillanova)
Task templates
- Add task templates for tydiqa and xquad #2518 (@lewtun)
- Insert text classification template for Emotion dataset #2521 (@lewtun)
- Add summarization template #2529 (@lewtun)
- Add task template for automatic speech recognition #2533 (@lewtun)
- Remove task templates if required features are removed during
Dataset.map
#2540 (@lewtun) - Inject templates for ASR datasets #2565 (@lewtun)
General improvements and bug fixes
- Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
- Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
- Allow latest pyarrow version #2490 (@albertvillanova)
- Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
- Add Zenodo metadata file with license #2501 (@albertvillanova)
- add tensorflow-macos support #2493 (@slayerjain)
- Keep original features order #2453 (@albertvillanova)
- Add course banner #2506 (@sgugger)
- Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
- Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
- Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
- Improve performance of pandas arrow extractor #2519 (@albertvillanova)
- Fix fingerprint when moving cache dir #2509 (@lhoestq)
- Replace bad
n>1M
size tag #2527 (@lhoestq) - Fix dev version #2531 (@lhoestq)
- Sync with transformers disabling NOTSET #2534 (@albertvillanova)
- Fix logging levels #2544 (@albertvillanova)
- Add support for Split.ALL #2259 (@mariosasko)
- Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
- Make numpy arrow extractor faster #2505 (@lhoestq)
- fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
- Add ASR task and new languages to resources #2567 (@lewtun)
- Filter expected warning log from transformers #2571 (@albertvillanova)
- Fix BibTeX entry #2579 (@albertvillanova)
- Fix Counter import #2580 (@albertvillanova)
- Add aiohttp to tests extras require #2587 (@albertvillanova)
- Add language tags #2590 (@lewtun)
- Support pandas 1.3.0 read_csv #2593 (@lhoestq)
Dataset cards
- Updated Dataset Description #2420 (@binny-mathew)
- Update DatasetMetadata and ReadMe #2436 (@gchhablani)
- CRD3 dataset card #2515 (@wilsonyhlee)
- Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
- wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)
Docs
- no s at load_datasets #2479 (@julien-c)
- Fix docs custom stable version #2477 (@albertvillanova)
- Improve Features docs #2535 (@albertvillanova)
- Update README.md #2414 (@cryoff)
- Fix FileSystems documentation #2551 (@connor-mccarthy)
- Minor fix in loading metrics docs #2562 (@albertvillanova)
- Minor fix docs format for bertscore #2570 (@albertvillanova)
- Add streaming in load a dataset docs #2574 (@lhoestq)
1.8.0
Datasets Changes
- New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
- New: KLUE benchmark #2416 (@jungwhank)
- New: HendrycksTest #2370 (@andyzoujm)
- Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
- Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
- Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
- Fix: flores - fix download link #2448 (@mariosasko)
Datasets Features
- Add
desc
parameter inmap
forDatasetDict
object #2423 (@bhavitvyamalik) - Support sliced list arrays in cast #2461 (@lhoestq)
Dataset.cast
can now change the feature types of Sequence fields
- Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set
keep_in_memory=True
when loading a dataset to load it in memory
Datasets Cards
- adds license information for DailyDialog. #2419 (@aditya2211)
- add english language tags for ~100 datasets #2442 (@VictorSanh)
- Add copyright info to MLSUM dataset #2427 (@PhilipMay)
- Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
- Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)
General improvements and bug fixes
- Add DOI badge to README #2411 (@albertvillanova)
- Make datasets PEP-561 compliant #2417 (@SBrandeis)
- Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
- Fix CI six installation on linux #2432 (@lhoestq)
- Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
- Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
- doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
- add utf-8 while reading README #2418 (@bhavitvyamalik)
- Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
- Rename config and environment variable for in memory max size #2454 (@albertvillanova)
- Add version-specific BibTeX #2430 (@albertvillanova)
- Fix cross-reference typos in documentation #2456 (@albertvillanova)
- Better error message when using the wrong load_from_disk #2437 (@lhoestq)
Experimental and work in progress: Format a dataset for specific tasks
1.7.0
Dataset Changes
- New: NLU evaluation data #2238 (@dkajtoch)
- New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
- New: Bbaw egyptian #2290 (@phiwi)
- New: GooAQ #2260 (@bhavitvyamalik)
- New: SubjQA #2302 (@lewtun)
- New: Ascent KB #2341, #2349 (@phongnt570)
- New: HLGD #2325 (@tingofurro)
- New: Qasper #2346 (@cceyda)
- New: ConvQuestions benchmark #2372 (@PhilippChr)
- Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
- Update multi_woz_v22 - update checksum #2281 (@lhoestq)
- Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
- Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
- Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
- Update: web_science - fixed download link #2338 (@bhavitvyamalik)
- Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
- Update: conll2003 - correct labels #2369 (@philschmid)
- Update: offenseval_dravidian - update citations #2385 (@adeepH)
- Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
- Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
- Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
- Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
- Fix: head_qa - Fix keys #2408 (@lhoestq)
Dataset Features
- Implement Dataset add_item #1870 (@albertvillanova)
- Implement Dataset add_column #2145 (@albertvillanova)
- Implement Dataset to JSON #2248, #2352 (@albertvillanova)
- Add rename_columnS method #2312 (@SBrandeis)
- add
desc
totqdm
inDataset.map()
#2374 (@bhavitvyamalik) - Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)
Metric Changes
- New: CUAD metrics #2273 (@bhavitvyamalik)
- New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
- Update: CER - Docs, CER above 1 #2342 (@borisdayma)
General improvements and bug fixes
- Update black #2265 (@lhoestq)
- Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
- Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
- Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
- Fix query table with iterable #2269 (@lhoestq)
- Perform minor refactoring: use config #2253 (@albertvillanova)
- Update format, fingerprint and indices after add_item #2254 (@lhoestq)
- Always update metadata in arrow schema #2274 (@lhoestq)
- Make tests run faster #2266 (@lhoestq)
- Fix metadata validation with config names #2286 (@lhoestq)
- Fixed typo seperate->separate #2292 (@laksh9950)
- Allow collaborators to self-assign issues #2289 (@albertvillanova)
- Mapping in the distributed setting #2298 (@TevenLeScao)
- Fix conda release #2309 (@lhoestq)
- Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
- Set default name in init_dynamic_modules #2320 (@albertvillanova)
- Fix duplicate keys #2333 (@lhoestq)
- Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
- Metadata validation #2107 (@theo-m)
- Add Validation For README #2121 (@gchhablani)
- Fix overflow issue in interpolation search #2336 (@mariosasko)
- Datasets cli improvements #2315 (@mariosasko)
- Add
key
type and duplicates verification with hashing #2245 (@NikhilBartwal) - More consistent copy logic #2340 (@mariosasko)
- Update README vallidation rules #2353 (@gchhablani)
- normalized TOCs and titles in data cards #2355 (@yjernite)
- simpllify faiss index save #2351 (@Guitaricet)
- Allow "other-X" in licenses #2368 (@gchhablani)
- Improve ReadInstruction logic and update docs #2261 (@mariosasko)
- Disallow duplicate keys in yaml tags #2379 (@lhoestq)
- maintain YAML structure reading from README #2380 (@bhavitvyamalik)
- add dataset card title #2381 (@bhavitvyamalik)
- Add tests for dataset cards #2348 (@gchhablani)
- Improve example in rounding docs #2383 (@mariosasko)
- Paperswithcode dataset mapping #2404 (@julien-c)
- Free datasets with cache file in temp dir on exit #2403 (@mariosasko)
Experimental and work in progress: Format a dataset for specific tasks
- Task formatting for text classification & question answering #2255 (@SBrandeis)
- Add check for task templates on dataset load #2390 (@lewtun)
- Add args description to DatasetInfo #2384 (@lewtun)
- Improve task api code quality #2376 (@mariosasko)
1.6.2
Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets
, multiprocessed map
and load_from_disk
.
Breaking change:
- when using
Dataset.map
with theinput_columns
parameter, the resulting dataset will only have the columns frominput_columns
and the columns added by the map functions. The other columns are discarded.
1.6.1
1.6.0
Dataset changes
- New: MOROCO #2002 (@MihaelaGaman)
- New: CBT dataset #2044 (@gchhablani)
- New: MDD Dataset #2051 (@gchhablani)
- New: Multilingual dIalogAct benchMark (miam) #2047 (@eusip)
- New: bAbI QA tasks #2053 (@gchhablani)
- New: machine translated multilingual STS benchmark dataset #2090 (@PhilipMay)
- New: EURLEX legal NLP dataset #2114 (@iliaschalkidis)
- New: ECtHR legal NLP dataset #2114 (@iliaschalkidis)
- New: EU-REG-IR legal NLP dataset #2114 (@iliaschalkidis)
- New: NorNE dataset for Norwegian POS and NER #2154 (@versae)
- New: banking77 #2140 (@dkajtoch)
- New: OpenSLR #2173 #2215 #2221 (@cahya-wirawan)
- New: CUAD dataset #2219 (@bhavitvyamalik)
- Update: Gem V1.1 + new challenge sets#2142 #2186 (@yjernite)
- Update: Wikiann - added spans field #2141 (@rabeehk)
- Update: XTREME - Add tel to xtreme tatoeba #2180 (@lhoestq)
- Update: GLUE MRPC - added real label to test set #2216 (@philschmid)
- Fix: MultiWoz22 - fix dialogue action slot name and value #2136 (@adamlin120)
- Fix: wikiauto - fix link #2171 (@mounicam)
- Fix: wino_bias - use right splits #1930 (@JieyuZhao)
- Fix: lc_quad - update download checksum #2213 (@mariosasko)
- Fix newsgroup -fix one instance of 'train' to 'test' #2225 (@alexwdong)
- Fix: xnli - fix tuple key #2233 (@NikhilBartwal)
Dataset features
- Allow stateful function in dataset.map #1960 (@mariosasko)
- MIAM dataset - new citation details #2101 (@eusip)
- [Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset #2025 (@lhoestq)
- Allow pickling of big in-memory tables #2150 (@lhoestq)
- updated user permissions based on umask #2086 #2157 (@bhavitvyamalik)
- Fast table queries with interpolation search #2122 (@lhoestq)
- Concat only unique fields in DatasetInfo.from_merge #2163 (@mariosasko)
- Implementation of class_encode_column #2184 #2227 (@SBrandeis)
- Add support for axis in concatenate datasets #2151 (@albertvillanova)
- Set default in-memory value depending on the dataset size #2182 (@albertvillanova)
Metrics changes
- New: CER metric #2138 (@chutaklee)
- Update: WER - Compute metric iteratively #2111 (@albertvillanova)
- Update: seqeval - configurable options to
seqeval
metric #2204 (@marrodion)
Dataset cards
- REFreSD: Updated card using information from data statement and datasheet #2082 (@mcmillanmajora)
- Winobiais: fix split infos #2152 (@JieyuZhao)
- all: Fix size categories in YAML Tags #2074 (@gchhablani)
- LinCE: Updating citation information on LinCE readme #2205 (@gaguilar)
- Swda: Update README.md #2235 (@PierreColombo)
General improvements and bug fixes
- Refactorize Metric.compute signature to force keyword arguments only #2079 (@albertvillanova)
- Fix max_wait_time in requests #2085 (@lhoestq)
- Fix copy snippet in docs #2091 (@mariosasko)
- Fix deprecated warning message and docstring #2100 (@albertvillanova)
- Move Dataset.to_csv to csv module #2102 (@albertvillanova)
- Fix: Allows a feature to be named "_type" #2093 (@dcfidalgo)
- copy.deepcopy os.environ instead of copy #2119 (@NihalHarish)
- Replace legacy torch.Tensor constructor with torch.tensor #2126 (@mariosasko)
- Implement Dataset as context manager #2113 (@albertvillanova)
- Fix missing infos from concurrent dataset loading #2137 (@lhoestq)
- Pin fsspec lower than 0.9.0 #2172 (@lhoestq)
- Replace assertTrue(isinstance with assertIsInstance in tests #2164 (@mariosasko)
- add social thumbnial #2177 (@philschmid)
- Fix s3fs tests for py36 and py37+ #2183 (@lhoestq)
- Fix typo in huggingface hub #2192 (@LysandreJik)
- Update metadata if dataset features are modified #2087 (@mariosasko)
- fix missing indices_files in load_form_disk #2197 (@lhoestq)
- Fix backward compatibility in Dataset.load_from_disk #2199 (@albertvillanova)
- Fix ArrowWriter overwriting features in ArrowBasedBuilder #2201 (@lhoestq)
- Fix incorrect assertion in builder.py #2110 (@dreamgonfly)
- Remove Python2 leftovers #2208 (@mariosasko)
- Revert breaking change in cache_files property #2217 (@lhoestq)
- Set test cache config #2223 (@albertvillanova)
- Fix map when removing columns on a formatted dataset #2231 (@lhoestq)
- Refactorize tests to use Dataset as context manager #2191 (@albertvillanova)
- Preserve split type when reloading dataset #2168 (@mariosasko)
Docs
- make documentation more clear to use different cloud storage #2127 (@philschmid)
- Render docstring return type as inline #2147 (@albertvillanova)
- Add table classes to the documentation #2155 (@lhoestq)
- Pin docutils for better doc #2174 (@sgugger)
- Fix docstrings issues #2081 (@albertvillanova)
- Add code of conduct to the project #2209 (@albertvillanova)
- Add classes GenerateMode, DownloadConfig and Version to the documentation #2202 (@albertvillanova)
- Fix bash snippet formatting in ADD_NEW_DATASET.md #2234 (@mariosasko)
1.5.0
Datasets changes
- New: Europarl Bilingual #1874 (@lucadiliello)
- New: Stanford Sentiment Treebank #1961 (@patpizio)
- New: RO-STS #1978 (@lorinczb)
- New: newspop #1871 (@frankier)
- New: FashionMNIST #1999 (@gchhablani)
- New: Common voice #1886 (@BirgerMoell), #2063 (@patrickvonplaten)
- New: Cryptonite #2013 (@theo-m)
- New: RoSent #2011 (@gchhablani)
- New: PersiNLU reading-comprehension #2028 (@danyaljj)
- New: conllpp #1991 (@ZihanWangKi)
- New: LaRoSeDa #2004 (@MihaelaGaman)
- Update: unnecessary docstart check in conll-like datasets #2020 (@mariosasko)
- Update: semeval 2020 task 11 - add article_id and process test set template #1979 (@hemildesai)
- Update: Md gender - card update #2018 (@mcmillanmajora)
- Update: XQuAD - add Romanian #2023 (@M-Salti)
- Update: DROP - all answers #1980 (@KaijuML)
- Fix: TIMIT ASR - Make sure not only the first sample is used #1995 (@patrickvonplaten)
- Fix: Wikipedia - save memory by replacing root.clear with elem.clear #2037 (@miyamonz)
- Fix: Doc2dial update data_infos and data_loaders #2041 (@songfeng)
- Fix: ZEST - update download link #2057 (@matt-peters)
- Fix: ted_talks_iwslt - fix version error #2064 (@mariosasko)
Datasets Features
- Implement Dataset from CSV #1946 (@albertvillanova)
- Implement Dataset from JSON and JSON Lines #1943 (@albertvillanova)
- Implement Dataset from text #2030 (@albertvillanova)
- Optimize int precision for tokenization #1985 (@albertvillanova)
- This allows to save 75%+ of space when tokenizing a dataset
General Bug fixes and improvements
- Fix ArrowWriter closes stream at exit #1971 (@albertvillanova)
- feat(docs): navigate with left/right arrow keys #1974 (@ydcjeff)
- Fix various typos/grammer in the docs #2008 (@mariosasko)
- Update format columns in Dataset.rename_columns #2027 (@mariosasko)
- Replace print with logging in dataset scripts #2019 (@mariosasko)
- Raise an error for outdated sacrebleu versions #2033 (@lhoestq)
- Not all languages have 2 digit codes. #2016 (@asiddhant)
- Fix arrow memory checks issue in tests #2042 (@lhoestq)
- Support pickle protocol for dataset splits defined as ReadInstruction #2043 (@mariosasko)
- Preserve column ordering in Dataset.rename_column #2045 (@mariosasko)
- Fix text-classification tags #2049 (@gchhablani)
- Fix docstring rendering of Dataset/DatasetDict.from_csv args #2066 (@albertvillanova)
- Fixes check of TF_AVAILABLE and TORCH_AVAILABLE #2073 (@philschmid)
- Add and fix docstring for NamedSplit #2069 (@albertvillanova)
- Bump huggingface_hub version #2077 (@SBrandeis)
- Fix docstring issues #2072 (@albertvillanova)
1.4.1
1.4.0
Datasets Changes
- New: iapp_wiki_qa_squad #1873 (@cstorm125)
- New: Financial PhraseBank #1866 (@frankier)
- New: CoVoST2 #1935 (@patil-suraj)
- New: TIMIT #1903 (@vrindaprabhu)
- New: Mlama (multilingual lama) #1931 (@pdufter)
- New: FewRel #1823 (@gchhablani)
- New: CCAligned Multilingual Dataset #1815 (@gchhablani)
- New: Turkish News Category Lite #1967 (@yavuzKomecoglu)
- Update: WMT - use mirror links #1912 for better download speed (@lhoestq)
- Update: multi_nli - add missing fields #1950 (@bhavitvyamalik)
- Fix: ALT - fix duplicated examples in alt-parallel #1899 (@lhoestq)
- Fix: WMT datasets - fix download errors #1901 (@YangWang92), #1902 (@lhoestq)
- Fix: QA4MRE - fix download URLs #1918 (@M-Salti)
- Fix: Wiki_dpr - fix when with_embeddings is False or index_name is "no_index" #1925 (@lhoestq)
- Fix: Wiki_dpr - add missing scalar quantizer #1926 (@lhoestq)
- Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM #1970 (@yjernite)
Datasets Features
- Add to_dict and to_pandas for Dataset #1889 (@SBrandeis)
- Add to_csv for Dataset #1887 (@SBrandeis)
- Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
- Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
- This introduces new methods for Dataset objects: rename_column, remove_columns, flatten and cast.
- The old in-place methods rename_column_, remove_columns_, flatten_ and cast_ are now deprecated.
- Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
- Add cross-platform support for datasets-cli #1951 (@mariosasko)
Metrics Changes
Offline loading
- Handle timeouts #1952 (@lhoestq)
- Add datasets full offline mode with HF_DATASETS_OFFLINE #1976 (@lhoestq)
General improvements and bugfixes
- Replace flatten_nested #1879 (@albertvillanova)
- add missing info on how to add large files #1885 (@stas00)
- Docs for adding new column on formatted dataset #1888 (@lhoestq)
- Fix PandasArrayExtensionArray conversion to native type #1897 (@lhoestq)
- Bugfix for string_to_arrow timestamp[ns] support #1900 (@justin-yan)
- Fix to_pandas for boolean ArrayXD #1904 (@lhoestq)
- Fix logging imports and make all datasets use library logger #1914 (@albertvillanova)
- Standardizing datasets dtypes #1921 (@justin-yan)
- Remove unused py_utils objects #1916 (@albertvillanova)
- Fix save_to_disk with relative path #1923 (@lhoestq)
- Updating old cards #1928 (@mcmillanmajora)
- Improve typing and style and fix some inconsistencies #1929 (@mariosasko)
- Fix builder config creation with data_dir #1932 (@lhoestq)
- Disallow ClassLabel with no names #1938 (@lhoestq)
- Update documentation with not in place transforms and update DatasetDict #1947 (@lhoestq)
- Documentation for to_csv, to_pandas and to_dict #1953 (@lhoestq)
- typos + grammar #1955 (@stas00)
- Fix unused arguments #1962 (@mariosasko)
- Fix metrics collision in separate multiprocessed experiments #1966 (@lhoestq)
1.3.0
Dataset Features
- On-the-fly data transforms (#1795)
- ADD S3 support for downloading and uploading processed datasets (#1723)
- Allow loading dataset in-memory (#1792)
- Support future datasets (#1813)
- Enable/disable caching (#1703)
- Offline dataset loading (#1726)
Datasets Hub Features
- Loading from the Datasets Hub (#1860)
This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library.
Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation
Dataset Changes
- New: LJ Speech (#1878)
- New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
- New: cord 19 (#1850)
- New: Tweet Eval Dataset (#1829)
- New: CIFAR-100 Dataset (#1812)
- New: SICK (#1804)
- New: BBC Hindi NLI Dataset (#1158)
- New: Freebase QA Dataset (#1814)
- New: Arabic sarcasm (#1798)
- New: Semantic Scholar Open Research Corpus (#1606)
- New: DuoRC Dataset (#1800)
- New: Aggregated dataset for the GEM benchmark (#1807)
- New: CC-News dataset of English language articles (#1323)
- New: irc disentangle (#1586)
- New: Narrative QA Manual (#1778)
- New: Universal Morphologies (#1174)
- New: SILICONE (#1761)
- New: Librispeech ASR (#1767)
- New: OSCAR (#1694, #1868, #1833)
- New: CANER Corpus (#1684)
- New: Arabic Speech Corpus (#1852)
- New: id_liputan6 (#1740)
- New: Stuctured Argument Extraction for Korean dataset (#1748)
- New: TurkCorpus (#1732)
- New: Hatexplain Dataset (#1716)
- New: adversarialQA (#1714)
- Update: Doc2dial - reading comprehension update to latest version (#1816)
- Update: OPUS Open Subtitles - add with metadata information (#1865)
- Update: SWDA - use all metadata features(#1799)
- Update: SWDA - add metadata and correct splits (#1749)
- Update: CommonGen - update citation information (#1787)
- Update: SciFact - update URL (#1780)
- Update: BrWaC - update features name (#1736)
- Update: TLC - update urls to be github links (#1737)
- Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
- Fix: multi_woz_v22 - fix checksums (#1880)
- Fix: limit - fix url (#1861)
- Fix: WebNLG - fix test test + more field (#1739)
- Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
- Fix: reuters - add missing "brief" entries (#1744)
- Fix: thainer: empty token bug (#1734)
- Fix: lst20: empty token bug (#1734)
Metrics Changes
- New: Word Error Metric (#1847)
- New: COMET (#1577, #1753)
- Fix: bert_score - set version dependency (#1851)
Metric Docs
- Add metrics usage examples and tests (#1820)
CLI Changes
- [BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
- instead, use the huggingface-hub CLI
Bug fixes
- fix writing GPU Faiss index (#1862)
- update pyarrow import warning (#1782)
- Ignore definition line number of functions for caching (#1779)
- update saving and loading methods for faiss index so to accept path like objects (#1663)
- Print error message with filename when malformed CSV (#1826)
- Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)
Refactoring
- Refactoring: Create config module (#1848)
- Use a config id in the cache directory names for custom configs (#1754)
Logging
- Enable logging propagation and remove logging handler (#1845)