Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve substring search performance for corner cases #11

Open
bluss opened this issue Jul 29, 2015 · 24 comments
Open

Improve substring search performance for corner cases #11

bluss opened this issue Jul 29, 2015 · 24 comments

Comments

@bluss
Copy link
Contributor

bluss commented Jul 29, 2015

I've subjected jetscii 0.3.1 to a benchmark suite of just a few pathological examples for substring search.

_jetscii benches very well_, there's just evidence of pathlogical cases

Result (may be noisy, i.e. individual benchmarks may be flukes)

  • find - str::find with &str pattern
  • rfind - str::rfind with &str pattern
  • jetscii_find str::find with Substring pattern
  • pcmp_find experimental function twsearch::pcmp::find, an incomplete, incorrect two-way search
  • regex_find using regex
test bad_naive::find                                      ... bench:         274 ns/iter (+/- 19) = 270 MB/s
test bad_naive::jetscii_find                              ... bench:         504 ns/iter (+/- 118) = 146 MB/s
test bad_naive::pcmp_find                                 ... bench:          96 ns/iter (+/- 1) = 770 MB/s
test bad_naive::regex_find                                ... bench:       1,077 ns/iter (+/- 89) = 68 MB/s
test bad_naive::rfind                                     ... bench:         231 ns/iter (+/- 16) = 320 MB/s
test bad_naive_reversed::find                             ... bench:         175 ns/iter (+/- 9) = 422 MB/s
test bad_naive_reversed::jetscii_find                     ... bench:          54 ns/iter (+/- 4) = 1370 MB/s
test bad_naive_reversed::pcmp_find                        ... bench:          43 ns/iter (+/- 4) = 1720 MB/s
test bad_naive_reversed::regex_find                       ... bench:          36 ns/iter (+/- 3) = 2055 MB/s
test bad_naive_reversed::rfind                            ... bench:         367 ns/iter (+/- 54) = 201 MB/s
test pathological_aa_100k::find                           ... bench:       3,084 ns/iter (+/- 149) = 32425 MB/s
test pathological_aa_100k::jetscii_find                   ... bench:      24,116 ns/iter (+/- 1,041) = 4146 MB/s
test pathological_aa_100k::pcmp_find                      ... bench:      27,241 ns/iter (+/- 2,031) = 3670 MB/s
test pathological_aa_100k::regex_find                     ... bench:       4,453 ns/iter (+/- 262) = 22456 MB/s
test pathological_aa_100k::rfind                          ... bench:       2,824 ns/iter (+/- 207) = 35410 MB/s
test pathological_aab_100k::find                          ... bench:     889,518 ns/iter (+/- 55,787) = 337 MB/s
test pathological_aab_100k::jetscii_find                  ... bench:     265,300 ns/iter (+/- 15,265) = 1130 MB/s
test pathological_aab_100k::pcmp_find                     ... bench:     183,135 ns/iter (+/- 11,808) = 1638 MB/s
test pathological_aab_100k::regex_find                    ... bench:   3,750,744 ns/iter (+/- 392,444) = 79 MB/s
test pathological_aab_100k::rfind                         ... bench:     859,327 ns/iter (+/- 104,678) = 348 MB/s
test pathological_two_way_10k::find                       ... bench:     107,896 ns/iter (+/- 48,497) = 278 MB/s
test pathological_two_way_10k::jetscii_find               ... bench:       7,406 ns/iter (+/- 1,072) = 4050 MB/s
test pathological_two_way_10k::pcmp_find                  ... bench:      19,444 ns/iter (+/- 22,302) = 1542 MB/s
test pathological_two_way_10k::regex_find                 ... bench:       1,329 ns/iter (+/- 2,413) = 22573 MB/s
test pathological_two_way_10k::rfind                      ... bench:       2,463 ns/iter (+/- 4,146) = 12180 MB/s
test pathological_two_way_20k::find                       ... bench:     305,465 ns/iter (+/- 505,176) = 327 MB/s
test pathological_two_way_20k::jetscii_find               ... bench:      25,179 ns/iter (+/- 39,142) = 3971 MB/s
test pathological_two_way_20k::pcmp_find                  ... bench:      64,983 ns/iter (+/- 113,250) = 1538 MB/s
test pathological_two_way_20k::regex_find                 ... bench:       6,175 ns/iter (+/- 12,215) = 16194 MB/s
test pathological_two_way_20k::rfind                      ... bench:       3,673 ns/iter (+/- 6,390) = 27225 MB/s
test pathological_two_way_20k_reversed::find              ... bench:       2,492 ns/iter (+/- 4,523) = 24077 MB/s
test pathological_two_way_20k_reversed::jetscii_find      ... bench:      45,084 ns/iter (+/- 80,989) = 1330 MB/s
test pathological_two_way_20k_reversed::pcmp_find         ... bench:      39,623 ns/iter (+/- 65,178) = 1514 MB/s
test pathological_two_way_20k_reversed::regex_find        ... bench:     406,637 ns/iter (+/- 730,079) = 147 MB/s
test pathological_two_way_20k_reversed::rfind             ... bench:     155,639 ns/iter (+/- 284,944) = 385 MB/s
test short_long::find                                     ... bench:       3,924 ns/iter (+/- 4,175) = 650 MB/s
test short_long::jetscii_find                             ... bench:         676 ns/iter (+/- 1,291) = 3773 MB/s
test short_long::pcmp_find                                ... bench:         810 ns/iter (+/- 1,531) = 3149 MB/s
test short_long::regex_find                               ... bench:       2,821 ns/iter (+/- 748) = 904 MB/s
test short_long::rfind                                    ... bench:       2,253 ns/iter (+/- 3,445) = 1132 MB/s
test short_short::find                                    ... bench:          68 ns/iter (+/- 119) = 823 MB/s
test short_short::jetscii_find                            ... bench:         109 ns/iter (+/- 127) = 513 MB/s
test short_short::pcmp_find                               ... bench:          43 ns/iter (+/- 69) = 1302 MB/s
test short_short::regex_find                              ... bench:          54 ns/iter (+/- 122) = 1037 MB/s
test short_short::rfind                                   ... bench:         125 ns/iter (+/- 160) = 448 MB/s

Conclusions:

  • jetscii 0.3.1 has algorithmic trouble, probably O(n²) problematics, see testcase bad_naive
  • Both jetscii and pcmp are losing to the byteset optimization in libstd's StrSearcher, see testcase pathological_aa_100k

Benchmark source, very messy crate

@bluss
Copy link
Contributor Author

bluss commented Jul 29, 2015

I realized the bad_naive testcase was a bit short -- the needle below 16 bytes. A proper testcase is a worse pathological case for jetscii!

haystack: (0..100_000).map(|_| "a").collect::<String>()
needle: (0..100).map(|_| "a").collect::<String>() + "b"

test bad_naive_long::find                                 ... bench:     299,292 ns/iter (+/- 37,567) = 334 MB/s
test bad_naive_long::jetscii_find                         ... bench:  10,303,063 ns/iter (+/- 194,054) = 9 MB/s
test bad_naive_long::pcmp_find                            ... bench:      23,807 ns/iter (+/- 225) = 4200 MB/s

as consolation, regex is even worse..

@shepmaster
Copy link
Owner

Awesome, thanks for this! I'm certainly not surprised that there are pathological cases, but having them pointed out in test cases is even nicer.

One thing I assume will help is to not increment by one byte on a false positive. Any highlights of normal things to avoid before I read your tests and the standard lib implementation?

@shepmaster
Copy link
Owner

9 MB/s

Ouch. Lots to fix!

@bluss
Copy link
Contributor Author

bluss commented Jul 29, 2015

To avoid all algorithmic trouble, employ a substring search algorithm, i.e. some system to make sure it advances enough per trial. Here's an nice overview http://www-igm.univ-mlv.fr/~lecroq/string/

I don't know enough to have specific advice. I'm obviously trying to impl the two way algorithm with pcmp_find, but I'm not sure it's the best algorithm to pick for SSE4.2 substring search.

@bluss
Copy link
Contributor Author

bluss commented Jul 29, 2015

The MB/s is bytes of haystack searched per second, of course. Most of the testcases don't match, so they effectively search it all.

@shepmaster
Copy link
Owner

One thought is that the pcmpestrm instruction should be able to tell which byte fails strcmp, which I believe can be used to stride further.

@bluss
Copy link
Contributor Author

bluss commented Jul 29, 2015

absolutely, that's what you need to to for the two way algorithm.

Without the factorization that two-way does of the needle, I don't think you can step to the failed match position in general.

Not sure I can recommend diving into string search algorithms. 😄 My revival PR of two-way for libstd took me a week of work, and I didn't even implement it from scratch

@bluss
Copy link
Contributor Author

bluss commented Jul 29, 2015

That said, I think the pcmp_find is part way there. But I'd consider a different algorithm entirely maybe. Maybe Boyer-Moore-Horspool?

@shepmaster
Copy link
Owner

a different algorithm

Yeah, I'll skim through and see if anything jumps out as a key match. One nice thing is that you and I can collaborate on ideas and compete on implementations, and everyone wins!

@bluss
Copy link
Contributor Author

bluss commented Jul 30, 2015

Note that the byteset optimization is why str::find is so fast in the substring benchmark in your Readme. If you add an a anywhere in the needle, that optimization won't work.

@ArtemGr
Copy link

ArtemGr commented Aug 7, 2015

Have you seen strstr_sse42 in http://www.strchr.com/strcmp_and_strlen_using_sse_4.2?

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

@ArtemGr that's a quadratic algorithm too (it's a naive search)

; continue searching from the next byte

@ArtemGr
Copy link

ArtemGr commented Aug 7, 2015

I wondered if the next byte meant the one after the 15 already checked. Duh, at least I tried. )

BTW, you ever run those benchmars with native optimizations on (target=native)?

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

I didn't think about it. The pcmpestri instruction is emitted anyway since it's in an explicit asm!() block, so I'm not sure what could improve.

@ArtemGr
Copy link

ArtemGr commented Aug 7, 2015

The pcmpestri instruction is emitted anyway since it's in an explicit asm!() block, so I'm not sure what could improve.

The SSE version won't improve, but the non-SSE versions might. We could see how LLVM fares against manual assembly. Without "target=native" it can't compete.

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

That's true. Most of libcore's substring search (str::find) is bounds checked, and that often makes it impossible for the optimizer to do autovectorization. Crucially the things in the tightest loop in str::find are not very amenable to vectorization.

I think it's harder to vectorize the more advanced string search algorithm (by hand or by compiler). I've been experimenting with that, (pcmp_find) but my question is which string search algorithm is the best for this. Since I posted this my pcmp_find experiment has actually progressed, it's an improvement over str::find in every case now, but of course jetscii can beat it depending on the case.

In the easy cases, the simpler algorithm wins, in the tricky cases the string search algorithm should win.

I'm sure I've tried target-cpu=native with that at some point, and I'll do it again now.

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

when I run the benchmarks in the "twsearch" repo now, I see no difference in the twoway_find algorithm with native or not, unfortuately. It could probably be improved to be nicer for optimization some way, but the parts of the algorithm that are in the tightest loop (those parts here) are not so easy. The byteset check is random access of a single byte, and the "right part of needle" loop needs to count the number of bytes that matched.

@ArtemGr
Copy link

ArtemGr commented Aug 7, 2015

I hadn't much luck with "cargo rustc" so I did "cargo bench --verbose" and repeated a few "rustc" commands with "-C target-cpu=native" added.
Normal: http://pastebin.com/4PGJMkSv
Native: http://pastebin.com/EbJKjaVT

What I've noticed (normal vs. native):

test aaab_in_aab::twoway_rfind ... bench: 1,503,030 ns/iter (+/- 5,837) = 199 MB/s
test aaab_in_aab::twoway_rfind ... bench: 857,109 ns/iter (+/- 1,914) = 349 MB/s

test pathological_two_way_rev::twoway_rfind ... bench: 308,092 ns/iter (+/- 382) = 194 MB/s
test pathological_two_way_rev::twoway_rfind ... bench: 150,717 ns/iter (+/- 255) = 398 MB/s

Might be just quirks, I haven't examined the assembly.
Just my two cents.

The byteset check is random access of a single byte

Yeah, I can see how this would be hard for LLVM to optimize. I rather hoped that the scratchspace contained some naive searchers that we'd see turned by LLVM into SSE instructions.

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

cool ok, I have to look at tht!

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

I'm only using one particular platform (llvm calls it corei7-avx), so of course that is kind of a one eyed view on optimization.

@ArtemGr
Copy link

ArtemGr commented Aug 7, 2015

Well, give me a shout if you want to run some Opteron-3365 benchmarks.

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

Ok that's exciting, with native it seems to emit some avx instructions here, but yes, I think it's in a location that's outside of the innermost search loop. Getting it to emit a vectorized loop for the first counted loop (match first half of needle) would be great.

@bluss
Copy link
Contributor Author

bluss commented Aug 7, 2015

Alright, I guess I should think about multiple different platforms and microarches at some point

@dralley
Copy link

dralley commented Jun 25, 2022

@bluss If you still happen to have this benchmarking code (repo is longer available), it would be great to adapt for #54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants