Implement generic CRC-32 #24

ahomescu · 2024-02-07T04:51:17Z

Add a naive implementation of CRC-32 based
on the C code from the PNG specification.

codecov · 2024-02-13T16:47:18Z

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (fc8e11e) 77.64% compared to head (25708f1) 85.99%.
Report is 27 commits behind head on main.

Files	Patch %	Lines
zlib-rs/src/crc32.rs	82.30%	20 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #24      +/-   ##
==========================================
+ Coverage   77.64%   85.99%   +8.35%     
==========================================
  Files          28       29       +1     
  Lines        6388     6591     +203     
==========================================
+ Hits         4960     5668     +708     
+ Misses       1428      923     -505

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

folkertdev

left some comments.

For usage in the gzip logic it's important to have a function that can take in an existing checksum value and more bytes, and update the checksum with those bytes. It's not obvious how this implementation would do that to me (but hopefully that is simple to add?)

zlib-rs/src/crc32.rs

folkertdev · 2024-02-13T18:44:53Z

also as a general observation this already seems bigger and more complex than what crc32fast does in the general case . e.g.

https://github.com/srijs/rust-crc32fast/blob/master/src/baseline.rs

is quite simple to follow. Am I missing something that is implemented here but not there (besides the tables, which apparently should preferably be stored in a static, not a const srijs/rust-crc32fast@e61ce6a I did not know this, but that is useful information).

ahomescu · 2024-02-13T21:54:01Z

For usage in the gzip logic it's important to have a function that can take in an existing checksum value and more bytes, and update the checksum with those bytes. It's not obvious how this implementation would do that to me (but hopefully that is simple to add?)

Should be pretty simple, we can replace START_CHECKSUM at its uses with the given checksum value. I think even crc32_interleaved can support that.

ahomescu · 2024-02-13T21:57:49Z

also as a general observation this already seems bigger and more complex than what crc32fast does in the general case

That should be pretty simple to add by using the tables from build_crc32_table. My implementation here is based on zlib-ng's "braid" algorithm, i.e., the "interleaved" approach from https://github.com/zlib-ng/zlib-ng/blob/develop/doc/crc-doc.1.0.pdf. Is that overkill? Up to you, let me know how you'd like to proceed.

folkertdev · 2024-02-13T22:44:11Z

In that case I think it makes more sense to follow the zlib-ng implementation more closely (https://github.com/zlib-ng/zlib-ng/blob/develop/arch/generic/crc32_braid_c.c). it seems to have diverged somewhat from the original paper. This is often my experience with CS literature: the way to write something up for an academic paper is generally not how you actually implement things. It will be interesting to see how this compares to the crc32fast baseline algorithm (I'd hope the zlib variant is faster, otherwise what are they doing, right?)

The implementation in that C file seems straightforward, but we can probably clean it up given that we're using rust. But I think it's a good idea to use the same name (braid), type signature, and generally follow its structure.

I think using an array for the N + 1 crcs and other values is a clever idea. LLVM should have no problem with that. The braid function can take N as a const N: usize> const generic so it's clearer why we're doing things that way.

There is some arithmetic with 0xFFFF_FFFF that we can simplify because we know a u32 is 32 bits, (not a c_int that might be wider). so an xor 0xFFFF_FFFF is just a bitwise negation for us.

using macros to unroll without duplicating too much code is fine (our adler32 implementation does this too).

Add a naive implementation of CRC-32 based on the C code from the PNG specification. Add an optimized word-size implementation based on Kadatch and Jenkins. Add optimized interleaved implementation from section 4.11 of the same paper.

folkertdev · 2024-02-15T12:06:16Z

could you add some #[allow(unused)] to make the clippy warnings go away? I'd do it myself but I think your git settings don't allow for it.

I've been playing around with the simd version, and from what I see zlib-ng does much more manual work than it really needs. The crc32fast implementation is basically the same speed at a fraction of the code size.

Benchmark 1 (87 runs): cargo run --release --example crc32_bench sse silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          57.3ms ± 3.38ms    55.5ms … 84.6ms          7 ( 8%)        0%
  peak_rss           21.8MB ±  138KB    21.6MB … 22.2MB          0 ( 0%)        0%
  cpu_cycles          151M  ± 3.99M      147M  …  181M           7 ( 8%)        0%
  instructions        325M  ± 46.0K      325M  …  325M           2 ( 2%)        0%
  cache_references   3.70M  ± 49.5K     3.59M  … 3.83M           0 ( 0%)        0%
  cache_misses        980K  ±  109K      832K  … 1.36M           5 ( 6%)        0%
  branch_misses       746K  ± 30.6K      732K  … 1.02M           3 ( 3%)        0%
Benchmark 2 (88 runs): cargo run --release --example crc32_bench crc32fast silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          57.4ms ± 2.19ms    55.5ms … 69.7ms          3 ( 3%)          +  0.1% ±  1.5%
  peak_rss           21.8MB ±  156KB    21.6MB … 22.2MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          152M  ± 4.85M      147M  …  185M           2 ( 2%)          +  0.6% ±  0.9%
  instructions        323M  ± 46.4K      323M  …  323M           1 ( 1%)          -  0.6% ±  0.0%
  cache_references   3.71M  ±  100K     3.61M  … 4.40M           3 ( 3%)          +  0.2% ±  0.6%
  cache_misses        995K  ±  135K      849K  … 1.40M           6 ( 7%)          +  1.6% ±  3.7%
  branch_misses       746K  ± 14.4K      732K  …  823K           5 ( 6%)          +  0.1% ±  0.9%

But that benchmark measures performance when all the data is available. Zlib can en/decode in a streaming fashion, and the crc32 algorithm it uses is tuned for that: crc32fast only stores a u32 as its state, but zlib-ng actually keeps 4 simd values around to store the state. So when the data comes in in small increments, zlib-ng should be saving a bunch of work. It's hard to tell how much that matters in practice though. I'll try to run some benchmarks for that too

Also we do need the fused crc with memcpy, and I don't think that fits into the current crc32fast design. Somehow all of these bits need to be synthesized together.

With all that, if you want to make more open source contributions, I think you all have a better setup for running avx-512 SIMD code, so you might look into adding a 512-bit simd version based on https://github.com/zlib-ng/zlib-ng/blob/develop/arch/x86/crc32_fold_vpclmulqdq_tpl.h to crc32fast itself. Assuming our code will be close to what crc32fast does we can then use that for zlib-rs down the line.

folkertdev · 2024-02-15T12:21:31Z

hah, wow

Benchmark 1 (84 runs): cargo run --release --example crc32_bench sse-chunked silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          59.3ms ± 1.77ms    57.0ms … 64.6ms          4 ( 5%)        0%
  peak_rss           21.8MB ±  156KB    21.5MB … 22.1MB          0 ( 0%)        0%
  cpu_cycles          158M  ± 3.51M      154M  …  174M           6 ( 7%)        0%
  instructions        354M  ± 49.4K      354M  …  354M           2 ( 2%)        0%
  cache_references   3.76M  ± 57.0K     3.67M  … 3.94M           1 ( 1%)        0%
  cache_misses       1.09M  ±  161K      880K  … 1.52M           3 ( 4%)        0%
  branch_misses       742K  ± 6.52K      731K  …  771K           1 ( 1%)        0%
Benchmark 2 (55 runs): cargo run --release --example crc32_bench crc32fast-chunked silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          90.8ms ± 3.19ms    87.9ms …  111ms          1 ( 2%)        💩+ 53.2% ±  1.4%
  peak_rss           21.8MB ±  151KB    21.5MB … 22.2MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          280M  ± 7.58M      274M  …  329M           1 ( 2%)        💩+ 77.1% ±  1.2%
  instructions        434M  ± 60.2K      434M  …  434M           3 ( 5%)        💩+ 22.6% ±  0.0%
  cache_references   3.78M  ±  115K     3.59M  … 4.48M           1 ( 2%)          +  0.6% ±  0.8%
  cache_misses       1.18M  ±  173K      934K  … 1.63M           0 ( 0%)        💩+  8.3% ±  5.2%
  branch_misses       747K  ± 14.8K      735K  …  831K           3 ( 5%)          +  0.6% ±  0.5%

that makes a big difference! This is running

            let mut h = crc32fast::Hasher::new_with_initial(0);

            for c in input.chunks(32) {
                h.update(c);
            }
            println!("{:#x}", h.finalize());

and the equivalent thing for the zlib-ng approach. So, that is something that we might also consider upstreaming after we've cleaned it up considerably (the crc32fast code looks a lot nicer than the zlib-ng version)

folkertdev

thanks for implementing this!

I'll just fix the clippy issues on main.

ahomescu · 2024-02-17T01:01:14Z

Most of the clippy issues should go away once another part of the code starts using these functions.

ahomescu force-pushed the ahomescu/crc32 branch 4 times, most recently from 72d4279 to 390d0e1 Compare February 13, 2024 08:16

ahomescu marked this pull request as ready for review February 13, 2024 08:18

folkertdev reviewed Feb 13, 2024

View reviewed changes

zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved

zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved

zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved

zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved

zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved

ahomescu changed the title ~~First part of issue #14~~ Implement generic CRC-32 Feb 14, 2024

ahomescu force-pushed the ahomescu/crc32 branch 3 times, most recently from 33a1230 to 5bec7b6 Compare February 14, 2024 03:08

ahomescu force-pushed the ahomescu/crc32 branch from 5bec7b6 to 25708f1 Compare February 15, 2024 03:59

folkertdev approved these changes Feb 16, 2024

View reviewed changes

folkertdev merged commit b348386 into trifectatechfoundation:main Feb 16, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement generic CRC-32 #24

Implement generic CRC-32 #24

ahomescu commented Feb 7, 2024

codecov bot commented Feb 13, 2024 •

edited

Loading

folkertdev left a comment

folkertdev commented Feb 13, 2024

ahomescu commented Feb 13, 2024

ahomescu commented Feb 13, 2024

folkertdev commented Feb 13, 2024

folkertdev commented Feb 15, 2024

folkertdev commented Feb 15, 2024

folkertdev left a comment

ahomescu commented Feb 17, 2024

Implement generic CRC-32 #24

Implement generic CRC-32 #24

Conversation

ahomescu commented Feb 7, 2024

codecov bot commented Feb 13, 2024 • edited Loading

Codecov Report

folkertdev left a comment

Choose a reason for hiding this comment

folkertdev commented Feb 13, 2024

ahomescu commented Feb 13, 2024

ahomescu commented Feb 13, 2024

folkertdev commented Feb 13, 2024

folkertdev commented Feb 15, 2024

folkertdev commented Feb 15, 2024

folkertdev left a comment

Choose a reason for hiding this comment

ahomescu commented Feb 17, 2024

codecov bot commented Feb 13, 2024 •

edited

Loading