Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up BytesRefHash#sort #12775

Closed
wants to merge 7 commits into from

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Nov 6, 2023

Description

I noticed that we were seeing StableRadixSorter faster than RadixSorter in #91. And the StableRadixSorter becomes even faster after we use a MergeSorter instead of InPlaceMergeSorter in #12652. This PR proposes to use a StableRadixSorter instead of the RadixSorter for BytesRefHash, in favor of the fact that BytesRefHash always guarantees the 2x space for the ids.

In my benchmark that indexing 16 bytes UUIDS, the took of sort ~5,000,000 terms decreased ~35%.

flush round flush terms radix sort stable radix sort diff
1 5169965 2370 1572 -33.67%
2 5169965 2387 1492 -37.49%
3 5169965 2341 1544 -34.05%
4 5169965 2402 1446 -39.80%
5 5169965 2259 1471 -34.88%
6 5169965 2340 1516 -35.21%
7 5169965 2276 1465 -35.63%
8 5169965 2330 1495 -35.84%
9 5169965 2351 1570 -33.22%
10 5169965 2350 1572 -33.11%
avg 5169965 2340.6 1514.3 -35.30%

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 7, 2023

Update:

  • Make MergeSorter take advantage of setPivot to reduce decode. (implement setPivot should help binarySort too)
  • Fix the problem that StringSorter#getFallBackSorter did not use the passed in comparator which can void first k bytes comparing.

New benchmark result:

flush round flush terms radix sort stable radix sort diff
1 5169965 2328 1426 -38.75%
2 5169965 2269 1448 -36.18%
3 5169965 2314 1460 -36.91%
4 5169965 2306 1408 -38.94%
5 5169965 2253 1399 -37.91%
6 5169965 2329 1377 -40.88%
7 5169965 2211 1366 -38.22%
8 5169965 2363 1425 -39.70%
9 5169965 2233 1389 -37.80%
10 5169965 2294 1367 -40.41%
avg 5169965 2290 1406.5 -38.58%

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 7, 2023

For realistic data I tried to index wikimedium10m with ramBuffer = 1024m and accumulating the took of all invoking of BytesRefHash#sort. Result shows the took sum decreased from 27161ms -> 12203ms, which is much better than i expected.

@dweiss
Copy link
Contributor

dweiss commented Nov 7, 2023

Wow. Nice improvement!

@mikemccand
Copy link
Member

This is incredible speedup :)

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 8, 2023

Something odd: I accidentally run the benchmark on another mac with intel chip and the result is disappointing (no obvious improvements)

use stable sort: false, sort 5169965 terms, took: 5377ms
use stable sort: false, sort 5169965 terms, took: 5293ms
use stable sort: false, sort 5169965 terms, took: 5347ms
use stable sort: false, sort 5169965 terms, took: 5285ms
use stable sort: false, sort 5169965 terms, took: 5307ms
use stable sort: false, sort 5169965 terms, took: 5373ms
use stable sort: false, sort 5169965 terms, took: 5416ms
use stable sort: false, sort 5169965 terms, took: 5321ms
use stable sort: false, sort 5169965 terms, took: 5343ms
use stable sort: false, sort 5169965 terms, took: 5278ms
use stable sort: false, sort 5169965 terms, took: 5323ms

use stable sort: true, sort 5169965 terms, took: 5261ms
use stable sort: true, sort 5169965 terms, took: 5329ms
use stable sort: true, sort 5169965 terms, took: 5304ms
use stable sort: true, sort 5169965 terms, took: 5268ms
use stable sort: true, sort 5169965 terms, took: 5331ms
use stable sort: true, sort 5169965 terms, took: 5306ms
use stable sort: true, sort 5169965 terms, took: 5232ms
use stable sort: true, sort 5169965 terms, took: 5245ms
use stable sort: true, sort 5169965 terms, took: 5248ms
use stable sort: true, sort 5169965 terms, took: 5218ms
use stable sort: true, sort 5169965 terms, took: 5193ms

But I can easily reproduce the speedup on the M2 chip mac that produced all benchmark results above:

use stable sort: false, sort 5169965 terms, took: 2340ms
use stable sort: false, sort 5169965 terms, took: 2197ms
use stable sort: false, sort 5169965 terms, took: 2288ms
use stable sort: false, sort 5169965 terms, took: 2320ms
use stable sort: false, sort 5169965 terms, took: 2227ms
use stable sort: false, sort 5169965 terms, took: 2238ms
use stable sort: false, sort 5169965 terms, took: 2256ms
use stable sort: false, sort 5169965 terms, took: 2287ms
use stable sort: false, sort 5169965 terms, took: 2244ms
use stable sort: false, sort 5169965 terms, took: 2240ms
use stable sort: false, sort 5169965 terms, took: 2320ms

use stable sort: true, sort 5169965 terms, took: 1418ms
use stable sort: true, sort 5169965 terms, took: 1303ms
use stable sort: true, sort 5169965 terms, took: 1290ms
use stable sort: true, sort 5169965 terms, took: 1349ms
use stable sort: true, sort 5169965 terms, took: 1293ms
use stable sort: true, sort 5169965 terms, took: 1329ms
use stable sort: true, sort 5169965 terms, took: 1332ms
use stable sort: true, sort 5169965 terms, took: 1307ms
use stable sort: true, sort 5169965 terms, took: 1280ms
use stable sort: true, sort 5169965 terms, took: 1336ms
use stable sort: true, sort 5169965 terms, took: 1302ms

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 8, 2023

I checked linux x86, no obvious speedup too. Maybe the stable style somewhat helped the arm CPU cache

use stable sort: false, sort 5169965 terms, took: 4900ms
use stable sort: false, sort 5169965 terms, took: 5373ms
use stable sort: false, sort 5169965 terms, took: 4998ms
use stable sort: false, sort 5169965 terms, took: 5345ms
use stable sort: false, sort 5169965 terms, took: 5051ms
use stable sort: false, sort 5169965 terms, took: 4906ms
use stable sort: false, sort 5169965 terms, took: 4869ms
use stable sort: false, sort 5169965 terms, took: 4840ms
use stable sort: false, sort 5169965 terms, took: 4816ms
use stable sort: false, sort 5169965 terms, took: 4760ms
use stable sort: false, sort 5169965 terms, took: 4903ms

use stable sort: true, sort 5169965 terms, took: 4883ms
use stable sort: true, sort 5169965 terms, took: 4852ms
use stable sort: true, sort 5169965 terms, took: 4920ms
use stable sort: true, sort 5169965 terms, took: 4863ms
use stable sort: true, sort 5169965 terms, took: 4836ms
use stable sort: true, sort 5169965 terms, took: 4917ms
use stable sort: true, sort 5169965 terms, took: 4836ms
use stable sort: true, sort 5169965 terms, took: 4900ms
use stable sort: true, sort 5169965 terms, took: 4888ms
use stable sort: true, sort 5169965 terms, took: 4921ms
use stable sort: true, sort 5169965 terms, took: 4835ms

@dweiss
Copy link
Contributor

dweiss commented Nov 8, 2023

Maybe Apple chips are tuned for sorting (they need to sort out what to do with all these income bills, after all)? :)

And seriously - thank you for checking on different hardware. There may be no single solution to fit them all - as the PR with vectorization benchmarks clearly shows...

@jpountz
Copy link
Contributor

jpountz commented Nov 8, 2023

FWIW I think it's fine if this yields a speedup on some platforms but not other platforms. From your results, it doesn't seem to slow down flushing either?

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 8, 2023

it's fine if this yields a speedup on some platforms but not other platforms

+1

it doesn't seem to slow down flushing either?

Yes! Nothing got slow down.

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 8, 2023

I came up with #12784 as another idea to speed up BytesRefHash#sort, which has been shown to have performance improvements running on Intel chips.

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 10, 2023

Close this in favor of #12784

@gf2121 gf2121 closed this Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants