-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve substring search performance for corner cases #11
Comments
I realized the haystack:
as consolation, regex is even worse.. |
Awesome, thanks for this! I'm certainly not surprised that there are pathological cases, but having them pointed out in test cases is even nicer. One thing I assume will help is to not increment by one byte on a false positive. Any highlights of normal things to avoid before I read your tests and the standard lib implementation? |
Ouch. Lots to fix! |
To avoid all algorithmic trouble, employ a substring search algorithm, i.e. some system to make sure it advances enough per trial. Here's an nice overview http://www-igm.univ-mlv.fr/~lecroq/string/ I don't know enough to have specific advice. I'm obviously trying to impl the two way algorithm with pcmp_find, but I'm not sure it's the best algorithm to pick for SSE4.2 substring search. |
The MB/s is bytes of haystack searched per second, of course. Most of the testcases don't match, so they effectively search it all. |
One thought is that the |
absolutely, that's what you need to to for the two way algorithm. Without the factorization that two-way does of the needle, I don't think you can step to the failed match position in general. Not sure I can recommend diving into string search algorithms. 😄 My revival PR of two-way for libstd took me a week of work, and I didn't even implement it from scratch |
That said, I think the pcmp_find is part way there. But I'd consider a different algorithm entirely maybe. Maybe Boyer-Moore-Horspool? |
Yeah, I'll skim through and see if anything jumps out as a key match. One nice thing is that you and I can collaborate on ideas and compete on implementations, and everyone wins! |
Note that the byteset optimization is why |
Have you seen strstr_sse42 in http://www.strchr.com/strcmp_and_strlen_using_sse_4.2? |
@ArtemGr that's a quadratic algorithm too (it's a naive search)
|
I wondered if the next byte meant the one after the 15 already checked. Duh, at least I tried. ) BTW, you ever run those benchmars with native optimizations on (target=native)? |
I didn't think about it. The pcmpestri instruction is emitted anyway since it's in an explicit |
The SSE version won't improve, but the non-SSE versions might. We could see how LLVM fares against manual assembly. Without "target=native" it can't compete. |
That's true. Most of libcore's substring search (str::find) is bounds checked, and that often makes it impossible for the optimizer to do autovectorization. Crucially the things in the tightest loop in I think it's harder to vectorize the more advanced string search algorithm (by hand or by compiler). I've been experimenting with that, (pcmp_find) but my question is which string search algorithm is the best for this. Since I posted this my pcmp_find experiment has actually progressed, it's an improvement over str::find in every case now, but of course jetscii can beat it depending on the case. In the easy cases, the simpler algorithm wins, in the tricky cases the string search algorithm should win. I'm sure I've tried target-cpu=native with that at some point, and I'll do it again now. |
when I run the benchmarks in the "twsearch" repo now, I see no difference in the twoway_find algorithm with native or not, unfortuately. It could probably be improved to be nicer for optimization some way, but the parts of the algorithm that are in the tightest loop (those parts here) are not so easy. The byteset check is random access of a single byte, and the "right part of needle" loop needs to count the number of bytes that matched. |
I hadn't much luck with "cargo rustc" so I did "cargo bench --verbose" and repeated a few "rustc" commands with "-C target-cpu=native" added. What I've noticed (normal vs. native): test aaab_in_aab::twoway_rfind ... bench: 1,503,030 ns/iter (+/- 5,837) = 199 MB/s test pathological_two_way_rev::twoway_rfind ... bench: 308,092 ns/iter (+/- 382) = 194 MB/s Might be just quirks, I haven't examined the assembly.
Yeah, I can see how this would be hard for LLVM to optimize. I rather hoped that the scratchspace contained some naive searchers that we'd see turned by LLVM into SSE instructions. |
cool ok, I have to look at tht! |
I'm only using one particular platform (llvm calls it corei7-avx), so of course that is kind of a one eyed view on optimization. |
Well, give me a shout if you want to run some Opteron-3365 benchmarks. |
Ok that's exciting, with native it seems to emit some avx instructions here, but yes, I think it's in a location that's outside of the innermost search loop. Getting it to emit a vectorized loop for the first counted loop (match first half of needle) would be great. |
Alright, I guess I should think about multiple different platforms and microarches at some point |
I've subjected jetscii 0.3.1 to a benchmark suite of just a few pathological examples for substring search.
_jetscii benches very well_, there's just evidence of pathlogical cases
Result (may be noisy, i.e. individual benchmarks may be flukes)
find
-str::find
with&str
patternrfind
-str::rfind
with&str
patternjetscii_find
str::find
withSubstring
patternpcmp_find
experimental function twsearch::pcmp::find, an incomplete, incorrect two-way searchregex_find
using regexConclusions:
bad_naive
libstd
's StrSearcher, see testcasepathological_aa_100k
Benchmark source, very messy crate
The text was updated successfully, but these errors were encountered: