Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimize ANDing uncompressed and compressed sorted runs:
Unfortunately, the fast-case code is significantly ugly, but the speedup is worth it (AND gets almost 200% faster, and this transfers to almost 10% faster total time, depending on the workload [1]).
On the other hand, experiments with AVX2 were slightly disappointing - they give another 5% performance, but are even uglier and also make the code dependent on the CPU flags (so we probably need a fallback anyway). Most people on the internet manage to get a larger speedup from AVX, so this may be a skill issue. Anyway after this PR "and" stops being a bottleneck again.
Similarly, tests with SSE were maybe a bit faster (within a test margin of error) but this made code more complicated). I still believe it's possible to make this significantly faster with enogh SIMD magic, but that's good enough I guess.
[1] If you wonder why the numbers don't add up - speeding up set operations also inexplicably slows down disk reads. This is probably because of disk prefetching going on under the hood.