Skip to content

Commit

Permalink
[Grammar] Update 10-4 Multiple Compares Single Branch.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Sep 23, 2024
1 parent f4e7304 commit c985ca8
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ We start by preparing an 8-byte mask filled with `eol` symbols. The inner loop l
If the mask is zero, that means there are no `eol` characters in the current chunk and we can skip it (see line 11). This is a critical optimization that provides large speedups for input strings with long lines. If a mask is not zero, that means there are `eol` characters and we need to find their positions. To do so, we use the `tzcnt` function, which counts the number of trailing zero bits in an 8-bit mask (the position of the rightmost set bit). For example, for the mask `0b00101000`, it will return 3. Most ISAs support implementing the `tzcnt` function with a single instruction.[^3] Line 14 calculates the length of the current line using the result of the `tzcnt` function. We shift right the mask and repeat until there are no set bits in the mask.
For an input string with a single very long line (best case scenario), the SIMD version will execute eight times fewer branch instructions. However, in the worst case scenario with zero-length lines (i.e., only `eol` characters in the input string), the original approach is faster. We benchmarked this technique using AVX2 implementation (with chunks of 16 characters) on several different inputs, including textbooks, and source code files. The result was 5--6 times fewer branch instructions and more than 4x better performance when running on Intel Core i7-1260P (12th Gen, Alderlake).
For an input string with a single very long line (best case scenario), the SIMD version will execute eight times fewer branch instructions. However, in the worst-case scenario with zero-length lines (i.e., only `eol` characters in the input string), the original approach is faster. We benchmarked this technique using AVX2 implementation (with chunks of 16 characters) on several different inputs, including textbooks, and source code files. The result was 5--6 times fewer branch instructions and more than 4x better performance when running on Intel Core i7-1260P (12th Gen, Alderlake).
[^1]: Assuming that compiler will avoid generating branch instructions for `std::max`.
[^2]: Performance Ninja: compiler intrinsics 2 - [https://github.com/dendibakh/perf-ninja/tree/main/labs/core_bound/compiler_intrinsics_2](https://github.com/dendibakh/perf-ninja/tree/main/labs/core_bound/compiler_intrinsics_2).
Expand Down

0 comments on commit c985ca8

Please sign in to comment.