Dataset changes

New: MOROCO #2002 (@MihaelaGaman)
New: CBT dataset #2044 (@gchhablani)
New: MDD Dataset #2051 (@gchhablani)
New: Multilingual dIalogAct benchMark (miam) #2047 (@eusip)
New: bAbI QA tasks #2053 (@gchhablani)
New: machine translated multilingual STS benchmark dataset #2090 (@PhilipMay)
New: EURLEX legal NLP dataset #2114 (@iliaschalkidis)
New: ECtHR legal NLP dataset #2114 (@iliaschalkidis)
New: EU-REG-IR legal NLP dataset #2114 (@iliaschalkidis)
New: NorNE dataset for Norwegian POS and NER #2154 (@versae)
New: banking77 #2140 (@dkajtoch)
New: OpenSLR #2173 #2215 #2221 (@cahya-wirawan)
New: CUAD dataset #2219 (@bhavitvyamalik)
Update: Gem V1.1 + new challenge sets#2142 #2186 (@yjernite)
Update: Wikiann - added spans field #2141 (@rabeehk)
Update: XTREME - Add tel to xtreme tatoeba #2180 (@lhoestq)
Update: GLUE MRPC - added real label to test set #2216 (@philschmid)
Fix: MultiWoz22 - fix dialogue action slot name and value #2136 (@adamlin120)
Fix: wikiauto - fix link #2171 (@mounicam)
Fix: wino_bias - use right splits #1930 (@JieyuZhao)
Fix: lc_quad - update download checksum #2213 (@mariosasko)
Fix newsgroup -fix one instance of 'train' to 'test' #2225 (@alexwdong)
Fix: xnli - fix tuple key #2233 (@NikhilBartwal)

Dataset features

Allow stateful function in dataset.map #1960 (@mariosasko)
MIAM dataset - new citation details #2101 (@eusip)
[Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset #2025 (@lhoestq)
Allow pickling of big in-memory tables #2150 (@lhoestq)
updated user permissions based on umask #2086 #2157 (@bhavitvyamalik)
Fast table queries with interpolation search #2122 (@lhoestq)
Concat only unique fields in DatasetInfo.from_merge #2163 (@mariosasko)
Implementation of class_encode_column #2184 #2227 (@SBrandeis)
Add support for axis in concatenate datasets #2151 (@albertvillanova)
Set default in-memory value depending on the dataset size #2182 (@albertvillanova)

Metrics changes

New: CER metric #2138 (@chutaklee)
Update: WER - Compute metric iteratively #2111 (@albertvillanova)
Update: seqeval - configurable options to seqeval metric #2204 (@marrodion)

Dataset cards

REFreSD: Updated card using information from data statement and datasheet #2082 (@mcmillanmajora)
Winobiais: fix split infos #2152 (@JieyuZhao)
all: Fix size categories in YAML Tags #2074 (@gchhablani)
LinCE: Updating citation information on LinCE readme #2205 (@gaguilar)
Swda: Update README.md #2235 (@PierreColombo)

General improvements and bug fixes

Refactorize Metric.compute signature to force keyword arguments only #2079 (@albertvillanova)
Fix max_wait_time in requests #2085 (@lhoestq)
Fix copy snippet in docs #2091 (@mariosasko)
Fix deprecated warning message and docstring #2100 (@albertvillanova)
Move Dataset.to_csv to csv module #2102 (@albertvillanova)
Fix: Allows a feature to be named "_type" #2093 (@dcfidalgo)
copy.deepcopy os.environ instead of copy #2119 (@NihalHarish)
Replace legacy torch.Tensor constructor with torch.tensor #2126 (@mariosasko)
Implement Dataset as context manager #2113 (@albertvillanova)
Fix missing infos from concurrent dataset loading #2137 (@lhoestq)
Pin fsspec lower than 0.9.0 #2172 (@lhoestq)
Replace assertTrue(isinstance with assertIsInstance in tests #2164 (@mariosasko)
add social thumbnial #2177 (@philschmid)
Fix s3fs tests for py36 and py37+ #2183 (@lhoestq)
Fix typo in huggingface hub #2192 (@LysandreJik)
Update metadata if dataset features are modified #2087 (@mariosasko)
fix missing indices_files in load_form_disk #2197 (@lhoestq)
Fix backward compatibility in Dataset.load_from_disk #2199 (@albertvillanova)
Fix ArrowWriter overwriting features in ArrowBasedBuilder #2201 (@lhoestq)
Fix incorrect assertion in builder.py #2110 (@dreamgonfly)
Remove Python2 leftovers #2208 (@mariosasko)
Revert breaking change in cache_files property #2217 (@lhoestq)
Set test cache config #2223 (@albertvillanova)
Fix map when removing columns on a formatted dataset #2231 (@lhoestq)
Refactorize tests to use Dataset as context manager #2191 (@albertvillanova)
Preserve split type when reloading dataset #2168 (@mariosasko)

Docs

make documentation more clear to use different cloud storage #2127 (@philschmid)
Render docstring return type as inline #2147 (@albertvillanova)
Add table classes to the documentation #2155 (@lhoestq)
Pin docutils for better doc #2174 (@sgugger)
Fix docstrings issues #2081 (@albertvillanova)
Add code of conduct to the project #2209 (@albertvillanova)
Add classes GenerateMode, DownloadConfig and Version to the documentation #2202 (@albertvillanova)
Fix bash snippet formatting in ADD_NEW_DATASET.md #2234 (@mariosasko)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.6.0

Dataset changes

Dataset features

Metrics changes

Dataset cards

General improvements and bug fixes

Docs