Use insertion sort for small `sort_and_dedup_dst_buf` inputs #459

silentbicycle · 2024-01-09T18:51:38Z

As a follow up to #456 -- kate mentioned that examples/words still had the same outliers (#456 (comment)). As hashing dropped lower in the profile, spending a disproportionate amount of time in sort_and_dedup_dst_buf showed up. This makes sense, because the outliers in the words output come from blab's successive seeds generating specific outputs (such as combining 855 instances of the regex GyADewtfILV53SSPua4IKbEZbug6BWTJ8o22K6ydeXs0NWsMADsGyADewtfILV53SSPua4IKbEZbug6BWTJ8o22K6ydeXs0NWsMA...) that are pathological knots for determinisation to untangle, not just a gradually increasing input size.

Currently, sort_and_dedup_dst_buf handles its input with a couple strategies:

a single state: return immediately, no change
only a single value (duplicated): return the one value, done
an array where max-min is small fairly small: use a stack-allocated bit set to simultaneously sort and dedup, to avoid qsort's indirection overhead when possible
otherwise: qsort

The outlier appeared to be due to from frequently calling it with inputs which only had a few values, but a large delta between the min and max values, so it was effectively zeroing a couple KB on the stack, setting just a few bits, and then sweeping over it again to write out an array of the set bits -- when the input array is fairly small (< = SMALL_INPUT_LIMIT), it's faster to append ascending values into a temporary buffer (so already sorted input just copies over) and then shift it forward as necessary to plug in the remaining values. In other words, use insertion sort when the entire data set fits within a few cache lines. Iit often does.

When using a words.sh output test data set kate sent me for benchmarking, this brings the total runtime for words -dt from about 11.7 seconds to about 8.4 seconds on my laptop.

I also added a check for whether the input is already sorted and deduplicated, so the function doesn't need to do anything. During benchmarking that case was only reached a small percent of the time, but it's very cheap to also note that while sweeping through to find the min and max values.

I'm still experimenting with the limit, but it's common for sort_and_dedup_dst_buf to be called with a fairly small input array. The bitset approach's cost is proportional to the max - min value delta, so when there's only a few entries it may be faster to do insertion sort instead -- 16 entries will probably fit in a single cache line. Changing SMALL_INPUT_LIMIT changes cutoff, commenting it out disables that approach outright. Setting SMALL_INPUT_CHECK to 1 enables a pass at the end that verifies the array is actually sorted and unique. Also, check upfront whether the input array is already sorted and unique. Experiments so far suggest that this doesn't pay off very often, but it's very cheap since we're already sweeping over the input to the min and max values.

I benchmarked with a range of values 8 to 4096, somewhere around 32 - 256 worked particularly well, and it started to become slower beyond that, so set it to <32 for now.

katef · 2024-01-09T19:00:27Z

As of this PR:

; ./words.sh 40000 100 | guff -b
    x: [0 - 19]    y: [0 - 3967]
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠂
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠀⠐⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠐⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠄⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⡇⠀⠀⢀⠀⠠⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣇⣀⣄⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀

silentbicycle added 2 commits January 9, 2024 13:28

silentbicycle added the enhancement label Jan 9, 2024

silentbicycle requested a review from katef January 9, 2024 18:51

katef merged commit 259d507 into main Jan 9, 2024

katef deleted the sv/use-insertion-sort-for-small-sort-and-dedup-inputs branch January 9, 2024 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use insertion sort for small `sort_and_dedup_dst_buf` inputs #459

Use insertion sort for small `sort_and_dedup_dst_buf` inputs #459

silentbicycle commented Jan 9, 2024

katef commented Jan 9, 2024

Use insertion sort for small sort_and_dedup_dst_buf inputs #459

Use insertion sort for small sort_and_dedup_dst_buf inputs #459

Conversation

silentbicycle commented Jan 9, 2024

katef commented Jan 9, 2024

Use insertion sort for small `sort_and_dedup_dst_buf` inputs #459

Use insertion sort for small `sort_and_dedup_dst_buf` inputs #459