Use insertion sort for small sort_and_dedup_dst_buf
inputs
#459
+88
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As a follow up to #456 -- kate mentioned that
examples/words
still had the same outliers (#456 (comment)). As hashing dropped lower in the profile, spending a disproportionate amount of time insort_and_dedup_dst_buf
showed up. This makes sense, because the outliers in thewords
output come from blab's successive seeds generating specific outputs (such as combining 855 instances of the regexGyADewtfILV53SSPua4IKbEZbug6BWTJ8o22K6ydeXs0NWsMADsGyADewtfILV53SSPua4IKbEZbug6BWTJ8o22K6ydeXs0NWsMA
...) that are pathological knots for determinisation to untangle, not just a gradually increasing input size.Currently,
sort_and_dedup_dst_buf
handles its input with a couple strategies:The outlier appeared to be due to from frequently calling it with inputs which only had a few values, but a large delta between the min and max values, so it was effectively zeroing a couple KB on the stack, setting just a few bits, and then sweeping over it again to write out an array of the set bits -- when the input array is fairly small (
< = SMALL_INPUT_LIMIT
), it's faster to append ascending values into a temporary buffer (so already sorted input just copies over) and then shift it forward as necessary to plug in the remaining values. In other words, use insertion sort when the entire data set fits within a few cache lines. Iit often does.When using a
words.sh
output test data set kate sent me for benchmarking, this brings the total runtime forwords -dt
from about 11.7 seconds to about 8.4 seconds on my laptop.I also added a check for whether the input is already sorted and deduplicated, so the function doesn't need to do anything. During benchmarking that case was only reached a small percent of the time, but it's very cheap to also note that while sweeping through to find the min and max values.