Speed up BytesRefHash#sort #12775

gf2121 · 2023-11-06T15:02:53Z

Description

I noticed that we were seeing StableRadixSorter faster than RadixSorter in #91. And the StableRadixSorter becomes even faster after we use a MergeSorter instead of InPlaceMergeSorter in #12652. This PR proposes to use a StableRadixSorter instead of the RadixSorter for BytesRefHash, in favor of the fact that BytesRefHash always guarantees the 2x space for the ids.

In my benchmark that indexing 16 bytes UUIDS, the took of sort ~5,000,000 terms decreased ~35%.

flush round	flush terms	radix sort	stable radix sort	diff
1	5169965	2370	1572	-33.67%
2	5169965	2387	1492	-37.49%
3	5169965	2341	1544	-34.05%
4	5169965	2402	1446	-39.80%
5	5169965	2259	1471	-34.88%
6	5169965	2340	1516	-35.21%
7	5169965	2276	1465	-35.63%
8	5169965	2330	1495	-35.84%
9	5169965	2351	1570	-33.22%
10	5169965	2350	1572	-33.11%
avg	5169965	2340.6	1514.3	-35.30%

gf2121 · 2023-11-07T07:56:04Z

Update:

Make MergeSorter take advantage of setPivot to reduce decode. (implement setPivot should help binarySort too)
Fix the problem that StringSorter#getFallBackSorter did not use the passed in comparator which can void first k bytes comparing.

New benchmark result:

flush round	flush terms	radix sort	stable radix sort	diff
1	5169965	2328	1426	-38.75%
2	5169965	2269	1448	-36.18%
3	5169965	2314	1460	-36.91%
4	5169965	2306	1408	-38.94%
5	5169965	2253	1399	-37.91%
6	5169965	2329	1377	-40.88%
7	5169965	2211	1366	-38.22%
8	5169965	2363	1425	-39.70%
9	5169965	2233	1389	-37.80%
10	5169965	2294	1367	-40.41%
avg	5169965	2290	1406.5	-38.58%

gf2121 · 2023-11-07T12:00:17Z

For realistic data I tried to index wikimedium10m with ramBuffer = 1024m and accumulating the took of all invoking of BytesRefHash#sort. Result shows the took sum decreased from 27161ms -> 12203ms, which is much better than i expected.

dweiss · 2023-11-07T13:45:41Z

Wow. Nice improvement!

mikemccand · 2023-11-07T14:12:18Z

This is incredible speedup :)

gf2121 · 2023-11-08T05:52:27Z

Something odd: I accidentally run the benchmark on another mac with intel chip and the result is disappointing (no obvious improvements)

use stable sort: false, sort 5169965 terms, took: 5377ms
use stable sort: false, sort 5169965 terms, took: 5293ms
use stable sort: false, sort 5169965 terms, took: 5347ms
use stable sort: false, sort 5169965 terms, took: 5285ms
use stable sort: false, sort 5169965 terms, took: 5307ms
use stable sort: false, sort 5169965 terms, took: 5373ms
use stable sort: false, sort 5169965 terms, took: 5416ms
use stable sort: false, sort 5169965 terms, took: 5321ms
use stable sort: false, sort 5169965 terms, took: 5343ms
use stable sort: false, sort 5169965 terms, took: 5278ms
use stable sort: false, sort 5169965 terms, took: 5323ms

use stable sort: true, sort 5169965 terms, took: 5261ms
use stable sort: true, sort 5169965 terms, took: 5329ms
use stable sort: true, sort 5169965 terms, took: 5304ms
use stable sort: true, sort 5169965 terms, took: 5268ms
use stable sort: true, sort 5169965 terms, took: 5331ms
use stable sort: true, sort 5169965 terms, took: 5306ms
use stable sort: true, sort 5169965 terms, took: 5232ms
use stable sort: true, sort 5169965 terms, took: 5245ms
use stable sort: true, sort 5169965 terms, took: 5248ms
use stable sort: true, sort 5169965 terms, took: 5218ms
use stable sort: true, sort 5169965 terms, took: 5193ms

But I can easily reproduce the speedup on the M2 chip mac that produced all benchmark results above:

use stable sort: false, sort 5169965 terms, took: 2340ms
use stable sort: false, sort 5169965 terms, took: 2197ms
use stable sort: false, sort 5169965 terms, took: 2288ms
use stable sort: false, sort 5169965 terms, took: 2320ms
use stable sort: false, sort 5169965 terms, took: 2227ms
use stable sort: false, sort 5169965 terms, took: 2238ms
use stable sort: false, sort 5169965 terms, took: 2256ms
use stable sort: false, sort 5169965 terms, took: 2287ms
use stable sort: false, sort 5169965 terms, took: 2244ms
use stable sort: false, sort 5169965 terms, took: 2240ms
use stable sort: false, sort 5169965 terms, took: 2320ms

use stable sort: true, sort 5169965 terms, took: 1418ms
use stable sort: true, sort 5169965 terms, took: 1303ms
use stable sort: true, sort 5169965 terms, took: 1290ms
use stable sort: true, sort 5169965 terms, took: 1349ms
use stable sort: true, sort 5169965 terms, took: 1293ms
use stable sort: true, sort 5169965 terms, took: 1329ms
use stable sort: true, sort 5169965 terms, took: 1332ms
use stable sort: true, sort 5169965 terms, took: 1307ms
use stable sort: true, sort 5169965 terms, took: 1280ms
use stable sort: true, sort 5169965 terms, took: 1336ms
use stable sort: true, sort 5169965 terms, took: 1302ms

gf2121 · 2023-11-08T06:55:14Z

I checked linux x86, no obvious speedup too. Maybe the stable style somewhat helped the arm CPU cache

use stable sort: false, sort 5169965 terms, took: 4900ms
use stable sort: false, sort 5169965 terms, took: 5373ms
use stable sort: false, sort 5169965 terms, took: 4998ms
use stable sort: false, sort 5169965 terms, took: 5345ms
use stable sort: false, sort 5169965 terms, took: 5051ms
use stable sort: false, sort 5169965 terms, took: 4906ms
use stable sort: false, sort 5169965 terms, took: 4869ms
use stable sort: false, sort 5169965 terms, took: 4840ms
use stable sort: false, sort 5169965 terms, took: 4816ms
use stable sort: false, sort 5169965 terms, took: 4760ms
use stable sort: false, sort 5169965 terms, took: 4903ms

use stable sort: true, sort 5169965 terms, took: 4883ms
use stable sort: true, sort 5169965 terms, took: 4852ms
use stable sort: true, sort 5169965 terms, took: 4920ms
use stable sort: true, sort 5169965 terms, took: 4863ms
use stable sort: true, sort 5169965 terms, took: 4836ms
use stable sort: true, sort 5169965 terms, took: 4917ms
use stable sort: true, sort 5169965 terms, took: 4836ms
use stable sort: true, sort 5169965 terms, took: 4900ms
use stable sort: true, sort 5169965 terms, took: 4888ms
use stable sort: true, sort 5169965 terms, took: 4921ms
use stable sort: true, sort 5169965 terms, took: 4835ms

dweiss · 2023-11-08T07:30:24Z

Maybe Apple chips are tuned for sorting (they need to sort out what to do with all these income bills, after all)? :)

And seriously - thank you for checking on different hardware. There may be no single solution to fit them all - as the PR with vectorization benchmarks clearly shows...

jpountz · 2023-11-08T12:00:37Z

FWIW I think it's fine if this yields a speedup on some platforms but not other platforms. From your results, it doesn't seem to slow down flushing either?

gf2121 · 2023-11-08T12:06:16Z

it's fine if this yields a speedup on some platforms but not other platforms

+1

it doesn't seem to slow down flushing either?

Yes! Nothing got slow down.

gf2121 · 2023-11-08T19:01:17Z

I came up with #12784 as another idea to speed up BytesRefHash#sort, which has been shown to have performance improvements running on Intel chips.

gf2121 · 2023-11-10T08:02:31Z

Close this in favor of #12784

gf2121 and others added 7 commits November 6, 2023 22:35

speed up BytesRefHash#sort

938825a

add CHANGES

0d33056

Merge branch 'main' into speed_up_bytes_ref_hash_sort

d24bb2a

not only intro sorter can be fallback sorter now

1edc44d

speed up merge sorter witch pivot

ed0f11e

iter

85815d2

iter

b8ecacb

gf2121 mentioned this pull request Nov 8, 2023

Cache buckets to speed up BytesRefHash#sort #12784

Merged

gf2121 closed this Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up BytesRefHash#sort #12775

Speed up BytesRefHash#sort #12775

gf2121 commented Nov 6, 2023 •

edited

Loading

gf2121 commented Nov 7, 2023 •

edited

Loading

gf2121 commented Nov 7, 2023

dweiss commented Nov 7, 2023

mikemccand commented Nov 7, 2023

gf2121 commented Nov 8, 2023

gf2121 commented Nov 8, 2023

dweiss commented Nov 8, 2023

jpountz commented Nov 8, 2023

gf2121 commented Nov 8, 2023 •

edited

Loading

gf2121 commented Nov 8, 2023

gf2121 commented Nov 10, 2023

Speed up BytesRefHash#sort #12775

Speed up BytesRefHash#sort #12775

Conversation

gf2121 commented Nov 6, 2023 • edited Loading

Description

gf2121 commented Nov 7, 2023 • edited Loading

gf2121 commented Nov 7, 2023

dweiss commented Nov 7, 2023

mikemccand commented Nov 7, 2023

gf2121 commented Nov 8, 2023

gf2121 commented Nov 8, 2023

dweiss commented Nov 8, 2023

jpountz commented Nov 8, 2023

gf2121 commented Nov 8, 2023 • edited Loading

gf2121 commented Nov 8, 2023

gf2121 commented Nov 10, 2023

gf2121 commented Nov 6, 2023 •

edited

Loading

gf2121 commented Nov 7, 2023 •

edited

Loading

gf2121 commented Nov 8, 2023 •

edited

Loading