1.6.0
Dataset changes
- New: MOROCO #2002 (@MihaelaGaman)
- New: CBT dataset #2044 (@gchhablani)
- New: MDD Dataset #2051 (@gchhablani)
- New: Multilingual dIalogAct benchMark (miam) #2047 (@eusip)
- New: bAbI QA tasks #2053 (@gchhablani)
- New: machine translated multilingual STS benchmark dataset #2090 (@PhilipMay)
- New: EURLEX legal NLP dataset #2114 (@iliaschalkidis)
- New: ECtHR legal NLP dataset #2114 (@iliaschalkidis)
- New: EU-REG-IR legal NLP dataset #2114 (@iliaschalkidis)
- New: NorNE dataset for Norwegian POS and NER #2154 (@versae)
- New: banking77 #2140 (@dkajtoch)
- New: OpenSLR #2173 #2215 #2221 (@cahya-wirawan)
- New: CUAD dataset #2219 (@bhavitvyamalik)
- Update: Gem V1.1 + new challenge sets#2142 #2186 (@yjernite)
- Update: Wikiann - added spans field #2141 (@rabeehk)
- Update: XTREME - Add tel to xtreme tatoeba #2180 (@lhoestq)
- Update: GLUE MRPC - added real label to test set #2216 (@philschmid)
- Fix: MultiWoz22 - fix dialogue action slot name and value #2136 (@adamlin120)
- Fix: wikiauto - fix link #2171 (@mounicam)
- Fix: wino_bias - use right splits #1930 (@JieyuZhao)
- Fix: lc_quad - update download checksum #2213 (@mariosasko)
- Fix newsgroup -fix one instance of 'train' to 'test' #2225 (@alexwdong)
- Fix: xnli - fix tuple key #2233 (@NikhilBartwal)
Dataset features
- Allow stateful function in dataset.map #1960 (@mariosasko)
- MIAM dataset - new citation details #2101 (@eusip)
- [Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset #2025 (@lhoestq)
- Allow pickling of big in-memory tables #2150 (@lhoestq)
- updated user permissions based on umask #2086 #2157 (@bhavitvyamalik)
- Fast table queries with interpolation search #2122 (@lhoestq)
- Concat only unique fields in DatasetInfo.from_merge #2163 (@mariosasko)
- Implementation of class_encode_column #2184 #2227 (@SBrandeis)
- Add support for axis in concatenate datasets #2151 (@albertvillanova)
- Set default in-memory value depending on the dataset size #2182 (@albertvillanova)
Metrics changes
- New: CER metric #2138 (@chutaklee)
- Update: WER - Compute metric iteratively #2111 (@albertvillanova)
- Update: seqeval - configurable options to
seqeval
metric #2204 (@marrodion)
Dataset cards
- REFreSD: Updated card using information from data statement and datasheet #2082 (@mcmillanmajora)
- Winobiais: fix split infos #2152 (@JieyuZhao)
- all: Fix size categories in YAML Tags #2074 (@gchhablani)
- LinCE: Updating citation information on LinCE readme #2205 (@gaguilar)
- Swda: Update README.md #2235 (@PierreColombo)
General improvements and bug fixes
- Refactorize Metric.compute signature to force keyword arguments only #2079 (@albertvillanova)
- Fix max_wait_time in requests #2085 (@lhoestq)
- Fix copy snippet in docs #2091 (@mariosasko)
- Fix deprecated warning message and docstring #2100 (@albertvillanova)
- Move Dataset.to_csv to csv module #2102 (@albertvillanova)
- Fix: Allows a feature to be named "_type" #2093 (@dcfidalgo)
- copy.deepcopy os.environ instead of copy #2119 (@NihalHarish)
- Replace legacy torch.Tensor constructor with torch.tensor #2126 (@mariosasko)
- Implement Dataset as context manager #2113 (@albertvillanova)
- Fix missing infos from concurrent dataset loading #2137 (@lhoestq)
- Pin fsspec lower than 0.9.0 #2172 (@lhoestq)
- Replace assertTrue(isinstance with assertIsInstance in tests #2164 (@mariosasko)
- add social thumbnial #2177 (@philschmid)
- Fix s3fs tests for py36 and py37+ #2183 (@lhoestq)
- Fix typo in huggingface hub #2192 (@LysandreJik)
- Update metadata if dataset features are modified #2087 (@mariosasko)
- fix missing indices_files in load_form_disk #2197 (@lhoestq)
- Fix backward compatibility in Dataset.load_from_disk #2199 (@albertvillanova)
- Fix ArrowWriter overwriting features in ArrowBasedBuilder #2201 (@lhoestq)
- Fix incorrect assertion in builder.py #2110 (@dreamgonfly)
- Remove Python2 leftovers #2208 (@mariosasko)
- Revert breaking change in cache_files property #2217 (@lhoestq)
- Set test cache config #2223 (@albertvillanova)
- Fix map when removing columns on a formatted dataset #2231 (@lhoestq)
- Refactorize tests to use Dataset as context manager #2191 (@albertvillanova)
- Preserve split type when reloading dataset #2168 (@mariosasko)
Docs
- make documentation more clear to use different cloud storage #2127 (@philschmid)
- Render docstring return type as inline #2147 (@albertvillanova)
- Add table classes to the documentation #2155 (@lhoestq)
- Pin docutils for better doc #2174 (@sgugger)
- Fix docstrings issues #2081 (@albertvillanova)
- Add code of conduct to the project #2209 (@albertvillanova)
- Add classes GenerateMode, DownloadConfig and Version to the documentation #2202 (@albertvillanova)
- Fix bash snippet formatting in ADD_NEW_DATASET.md #2234 (@mariosasko)