Dill 3.7 support #6061

mariosasko · 2023-07-24T12:33:58Z

Adds support for dill 3.7.

HuggingFaceDocBuilderDev · 2023-07-24T12:41:05Z

The documentation is not available anymore as the PR was closed or merged.

setup.py

Co-authored-by: Quentin Lhoest <[email protected]>

github-actions · 2023-07-24T13:46:28Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007700 / 0.011353 (-0.003653)	0.004680 / 0.011008 (-0.006328)	0.098812 / 0.038508 (0.060304)	0.085062 / 0.023109 (0.061952)	0.371472 / 0.275898 (0.095574)	0.412552 / 0.323480 (0.089072)	0.004700 / 0.007986 (-0.003285)	0.003765 / 0.004328 (-0.000564)	0.074267 / 0.004250 (0.070017)	0.063003 / 0.037052 (0.025951)	0.391842 / 0.258489 (0.133353)	0.436955 / 0.293841 (0.143114)	0.035291 / 0.128546 (-0.093255)	0.009309 / 0.075646 (-0.066338)	0.313097 / 0.419271 (-0.106174)	0.060098 / 0.043533 (0.016565)	0.350726 / 0.255139 (0.095587)	0.402692 / 0.283200 (0.119493)	0.029321 / 0.141683 (-0.112361)	1.671806 / 1.452155 (0.219651)	1.743760 / 1.492716 (0.251044)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.242281 / 0.018006 (0.224275)	0.505054 / 0.000490 (0.504564)	0.006595 / 0.000200 (0.006395)	0.000091 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032174 / 0.037411 (-0.005238)	0.094483 / 0.014526 (0.079957)	0.108527 / 0.176557 (-0.068030)	0.178983 / 0.737135 (-0.558152)	0.113766 / 0.296338 (-0.182572)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.419764 / 0.215209 (0.204555)	4.282650 / 2.077655 (2.204995)	2.075325 / 1.504120 (0.571205)	1.897668 / 1.541195 (0.356473)	2.027109 / 1.468490 (0.558619)	0.519983 / 4.584777 (-4.064794)	4.134603 / 3.745712 (0.388891)	6.586711 / 5.269862 (1.316849)	3.811726 / 4.565676 (-0.753951)	0.058628 / 0.424275 (-0.365647)	0.007586 / 0.007607 (-0.000021)	0.502180 / 0.226044 (0.276136)	5.101588 / 2.268929 (2.832660)	2.534295 / 55.444624 (-52.910330)	2.220170 / 6.876477 (-4.656307)	2.441110 / 2.142072 (0.299038)	0.644775 / 4.805227 (-4.160452)	0.144716 / 6.500664 (-6.355948)	0.067018 / 0.075469 (-0.008451)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.431279 / 1.841788 (-0.410508)	21.947814 / 8.074308 (13.873506)	15.548236 / 10.191392 (5.356844)	0.174774 / 0.680424 (-0.505650)	0.021182 / 0.534201 (-0.513019)	0.441320 / 0.579283 (-0.137963)	0.476685 / 0.434364 (0.042321)	0.506277 / 0.540337 (-0.034060)	0.809943 / 1.386936 (-0.576993)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007172 / 0.011353 (-0.004181)	0.004358 / 0.011008 (-0.006650)	0.068604 / 0.038508 (0.030096)	0.083956 / 0.023109 (0.060847)	0.402579 / 0.275898 (0.126681)	0.444714 / 0.323480 (0.121235)	0.005940 / 0.007986 (-0.002046)	0.003607 / 0.004328 (-0.000722)	0.073134 / 0.004250 (0.068883)	0.061722 / 0.037052 (0.024669)	0.410957 / 0.258489 (0.152468)	0.458819 / 0.293841 (0.164978)	0.033710 / 0.128546 (-0.094836)	0.010230 / 0.075646 (-0.065417)	0.084678 / 0.419271 (-0.334593)	0.058203 / 0.043533 (0.014670)	0.444972 / 0.255139 (0.189833)	0.470962 / 0.283200 (0.187763)	0.029222 / 0.141683 (-0.112461)	1.671460 / 1.452155 (0.219306)	1.759471 / 1.492716 (0.266754)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.238894 / 0.018006 (0.220888)	0.493605 / 0.000490 (0.493115)	0.001979 / 0.000200 (0.001780)	0.000084 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036498 / 0.037411 (-0.000913)	0.095245 / 0.014526 (0.080719)	0.112147 / 0.176557 (-0.064409)	0.171128 / 0.737135 (-0.566007)	0.115295 / 0.296338 (-0.181044)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.461067 / 0.215209 (0.245858)	4.723932 / 2.077655 (2.646277)	2.432697 / 1.504120 (0.928578)	2.237302 / 1.541195 (0.696107)	2.351320 / 1.468490 (0.882830)	0.509963 / 4.584777 (-4.074813)	4.194817 / 3.745712 (0.449105)	6.689529 / 5.269862 (1.419667)	3.351198 / 4.565676 (-1.214478)	0.064563 / 0.424275 (-0.359712)	0.008605 / 0.007607 (0.000998)	0.575590 / 0.226044 (0.349546)	5.644179 / 2.268929 (3.375250)	3.021375 / 55.444624 (-52.423249)	2.595305 / 6.876477 (-4.281172)	2.839228 / 2.142072 (0.697156)	0.657148 / 4.805227 (-4.148079)	0.144831 / 6.500664 (-6.355834)	0.067882 / 0.075469 (-0.007587)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.595580 / 1.841788 (-0.246208)	22.431609 / 8.074308 (14.357301)	15.700845 / 10.191392 (5.509453)	0.164675 / 0.680424 (-0.515749)	0.021322 / 0.534201 (-0.512879)	0.455270 / 0.579283 (-0.124013)	0.451547 / 0.434364 (0.017183)	0.520955 / 0.540337 (-0.019383)	0.687803 / 1.386936 (-0.699133)

lhoestq

LGTM :)

github-actions · 2023-07-24T13:51:07Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008171 / 0.011353 (-0.003182)	0.005563 / 0.011008 (-0.005445)	0.102265 / 0.038508 (0.063757)	0.074755 / 0.023109 (0.051646)	0.431317 / 0.275898 (0.155419)	0.472179 / 0.323480 (0.148699)	0.006153 / 0.007986 (-0.001833)	0.003832 / 0.004328 (-0.000496)	0.078480 / 0.004250 (0.074230)	0.056250 / 0.037052 (0.019197)	0.432938 / 0.258489 (0.174449)	0.480983 / 0.293841 (0.187142)	0.048861 / 0.128546 (-0.079685)	0.016252 / 0.075646 (-0.059394)	0.343508 / 0.419271 (-0.075763)	0.065057 / 0.043533 (0.021524)	0.468418 / 0.255139 (0.213279)	0.463692 / 0.283200 (0.180492)	0.032912 / 0.141683 (-0.108771)	1.795194 / 1.452155 (0.343039)	1.833047 / 1.492716 (0.340331)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.197980 / 0.018006 (0.179974)	0.500662 / 0.000490 (0.500172)	0.007380 / 0.000200 (0.007181)	0.000110 / 0.000054 (0.000055)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028323 / 0.037411 (-0.009089)	0.089817 / 0.014526 (0.075291)	0.102923 / 0.176557 (-0.073633)	0.173851 / 0.737135 (-0.563284)	0.104006 / 0.296338 (-0.192333)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.580277 / 0.215209 (0.365068)	5.878739 / 2.077655 (3.801085)	2.404673 / 1.504120 (0.900553)	2.071765 / 1.541195 (0.530571)	2.106024 / 1.468490 (0.637534)	0.855217 / 4.584777 (-3.729560)	4.918602 / 3.745712 (1.172890)	5.354984 / 5.269862 (0.085122)	3.141288 / 4.565676 (-1.424389)	0.099553 / 0.424275 (-0.324723)	0.008152 / 0.007607 (0.000545)	0.709857 / 0.226044 (0.483813)	7.144602 / 2.268929 (4.875673)	3.137637 / 55.444624 (-52.306987)	2.379851 / 6.876477 (-4.496626)	2.346426 / 2.142072 (0.204353)	1.033416 / 4.805227 (-3.771811)	0.213120 / 6.500664 (-6.287544)	0.076037 / 0.075469 (0.000568)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.597742 / 1.841788 (-0.244046)	21.745366 / 8.074308 (13.671058)	20.830698 / 10.191392 (10.639306)	0.238727 / 0.680424 (-0.441697)	0.027923 / 0.534201 (-0.506278)	0.466073 / 0.579283 (-0.113210)	0.548647 / 0.434364 (0.114283)	0.549245 / 0.540337 (0.008908)	0.977148 / 1.386936 (-0.409788)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008252 / 0.011353 (-0.003101)	0.004653 / 0.011008 (-0.006356)	0.084012 / 0.038508 (0.045504)	0.077418 / 0.023109 (0.054309)	0.440748 / 0.275898 (0.164850)	0.464279 / 0.323480 (0.140799)	0.005762 / 0.007986 (-0.002224)	0.004909 / 0.004328 (0.000581)	0.086441 / 0.004250 (0.082190)	0.057883 / 0.037052 (0.020831)	0.466655 / 0.258489 (0.208166)	0.479751 / 0.293841 (0.185910)	0.047166 / 0.128546 (-0.081380)	0.014480 / 0.075646 (-0.061166)	0.092599 / 0.419271 (-0.326672)	0.062454 / 0.043533 (0.018921)	0.449753 / 0.255139 (0.194614)	0.461876 / 0.283200 (0.178676)	0.034828 / 0.141683 (-0.106855)	1.752249 / 1.452155 (0.300095)	1.865449 / 1.492716 (0.372732)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.245028 / 0.018006 (0.227022)	0.509564 / 0.000490 (0.509074)	0.003930 / 0.000200 (0.003730)	0.000110 / 0.000054 (0.000056)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034746 / 0.037411 (-0.002665)	0.096563 / 0.014526 (0.082037)	0.107581 / 0.176557 (-0.068975)	0.184952 / 0.737135 (-0.552184)	0.108747 / 0.296338 (-0.187591)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.613091 / 0.215209 (0.397882)	5.994985 / 2.077655 (3.917330)	2.711276 / 1.504120 (1.207156)	2.415862 / 1.541195 (0.874668)	2.391055 / 1.468490 (0.922565)	0.868723 / 4.584777 (-3.716054)	4.953992 / 3.745712 (1.208280)	4.606542 / 5.269862 (-0.663319)	2.942162 / 4.565676 (-1.623515)	0.102737 / 0.424275 (-0.321538)	0.008634 / 0.007607 (0.001027)	0.722122 / 0.226044 (0.496078)	7.245097 / 2.268929 (4.976168)	3.428232 / 55.444624 (-52.016393)	2.709539 / 6.876477 (-4.166938)	2.857956 / 2.142072 (0.715884)	1.045594 / 4.805227 (-3.759634)	0.213344 / 6.500664 (-6.287320)	0.073601 / 0.075469 (-0.001868)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.651954 / 1.841788 (-0.189834)	22.458646 / 8.074308 (14.384338)	19.583203 / 10.191392 (9.391811)	0.246932 / 0.680424 (-0.433492)	0.025730 / 0.534201 (-0.508471)	0.473475 / 0.579283 (-0.105808)	0.521411 / 0.434364 (0.087047)	0.562038 / 0.540337 (0.021700)	0.767673 / 1.386936 (-0.619263)

mariosasko · 2023-07-24T14:04:27Z

The CI error is unrelated.

github-actions · 2023-07-24T14:13:19Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006649 / 0.011353 (-0.004703)	0.003963 / 0.011008 (-0.007045)	0.084564 / 0.038508 (0.046056)	0.075668 / 0.023109 (0.052559)	0.314233 / 0.275898 (0.038335)	0.343320 / 0.323480 (0.019841)	0.005405 / 0.007986 (-0.002581)	0.003356 / 0.004328 (-0.000973)	0.065094 / 0.004250 (0.060844)	0.058774 / 0.037052 (0.021722)	0.320772 / 0.258489 (0.062283)	0.353546 / 0.293841 (0.059705)	0.030921 / 0.128546 (-0.097625)	0.008463 / 0.075646 (-0.067184)	0.287490 / 0.419271 (-0.131781)	0.053188 / 0.043533 (0.009656)	0.324023 / 0.255139 (0.068884)	0.337828 / 0.283200 (0.054628)	0.024764 / 0.141683 (-0.116918)	1.458028 / 1.452155 (0.005873)	1.521615 / 1.492716 (0.028899)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.209360 / 0.018006 (0.191353)	0.461331 / 0.000490 (0.460841)	0.000386 / 0.000200 (0.000186)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028405 / 0.037411 (-0.009006)	0.081074 / 0.014526 (0.066548)	0.094868 / 0.176557 (-0.081689)	0.151050 / 0.737135 (-0.586085)	0.095854 / 0.296338 (-0.200484)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.393957 / 0.215209 (0.178748)	3.938649 / 2.077655 (1.860994)	1.938190 / 1.504120 (0.434070)	1.766458 / 1.541195 (0.225263)	1.818028 / 1.468490 (0.349538)	0.483926 / 4.584777 (-4.100851)	3.641957 / 3.745712 (-0.103755)	4.883845 / 5.269862 (-0.386016)	2.960300 / 4.565676 (-1.605377)	0.057227 / 0.424275 (-0.367048)	0.007285 / 0.007607 (-0.000322)	0.475928 / 0.226044 (0.249884)	4.756757 / 2.268929 (2.487828)	2.502659 / 55.444624 (-52.941966)	2.178067 / 6.876477 (-4.698410)	2.378298 / 2.142072 (0.236226)	0.578639 / 4.805227 (-4.226588)	0.132512 / 6.500664 (-6.368152)	0.059656 / 0.075469 (-0.015813)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.272673 / 1.841788 (-0.569115)	19.266884 / 8.074308 (11.192576)	14.272930 / 10.191392 (4.081538)	0.165897 / 0.680424 (-0.514527)	0.018436 / 0.534201 (-0.515765)	0.395177 / 0.579283 (-0.184107)	0.420134 / 0.434364 (-0.014229)	0.460781 / 0.540337 (-0.079557)	0.645376 / 1.386936 (-0.741560)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006504 / 0.011353 (-0.004849)	0.003942 / 0.011008 (-0.007066)	0.064936 / 0.038508 (0.026428)	0.075015 / 0.023109 (0.051905)	0.396871 / 0.275898 (0.120973)	0.423448 / 0.323480 (0.099968)	0.005239 / 0.007986 (-0.002747)	0.003265 / 0.004328 (-0.001063)	0.064910 / 0.004250 (0.060660)	0.055006 / 0.037052 (0.017953)	0.392818 / 0.258489 (0.134329)	0.429735 / 0.293841 (0.135894)	0.031847 / 0.128546 (-0.096699)	0.008626 / 0.075646 (-0.067021)	0.071591 / 0.419271 (-0.347681)	0.049006 / 0.043533 (0.005473)	0.384913 / 0.255139 (0.129774)	0.408969 / 0.283200 (0.125769)	0.023573 / 0.141683 (-0.118110)	1.490271 / 1.452155 (0.038117)	1.564620 / 1.492716 (0.071904)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.225917 / 0.018006 (0.207911)	0.450369 / 0.000490 (0.449880)	0.000375 / 0.000200 (0.000175)	0.000055 / 0.000054 (0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031196 / 0.037411 (-0.006215)	0.090486 / 0.014526 (0.075960)	0.102326 / 0.176557 (-0.074231)	0.157483 / 0.737135 (-0.579653)	0.103670 / 0.296338 (-0.192668)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417577 / 0.215209 (0.202368)	4.170798 / 2.077655 (2.093143)	2.123689 / 1.504120 (0.619569)	1.948231 / 1.541195 (0.407037)	2.040277 / 1.468490 (0.571787)	0.497919 / 4.584777 (-4.086858)	3.633270 / 3.745712 (-0.112442)	4.851698 / 5.269862 (-0.418164)	2.691992 / 4.565676 (-1.873684)	0.058641 / 0.424275 (-0.365634)	0.007719 / 0.007607 (0.000112)	0.500652 / 0.226044 (0.274607)	4.988657 / 2.268929 (2.719728)	2.604488 / 55.444624 (-52.840136)	2.329829 / 6.876477 (-4.546648)	2.468239 / 2.142072 (0.326167)	0.598724 / 4.805227 (-4.206503)	0.135959 / 6.500664 (-6.364706)	0.061088 / 0.075469 (-0.014381)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.352107 / 1.841788 (-0.489681)	19.973976 / 8.074308 (11.899668)	14.292812 / 10.191392 (4.101420)	0.163855 / 0.680424 (-0.516568)	0.018402 / 0.534201 (-0.515799)	0.393128 / 0.579283 (-0.186155)	0.407379 / 0.434364 (-0.026985)	0.462324 / 0.540337 (-0.078013)	0.607501 / 1.386936 (-0.779435)

mariosasko added 2 commits July 24, 2023 14:15

Dill 3.7 support

9dacf1c

Nit

87d0422

lhoestq reviewed Jul 24, 2023

View reviewed changes

setup.py Outdated Show resolved Hide resolved

mariosasko and others added 2 commits July 24, 2023 15:36

Update setup.py

7d19574

Co-authored-by: Quentin Lhoest <[email protected]>

Nit

3869d99

lhoestq approved these changes Jul 24, 2023

View reviewed changes

mariosasko merged commit ae126ac into main Jul 24, 2023

mariosasko deleted the support-dill37 branch July 24, 2023 14:04

Dill 3.7 support #6061

Dill 3.7 support #6061

Conversation

mariosasko commented Jul 24, 2023

HuggingFaceDocBuilderDev commented Jul 24, 2023 • edited Loading

github-actions bot commented Jul 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko commented Jul 24, 2023

github-actions bot commented Jul 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 24, 2023 •

edited

Loading