Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement generic CRC-32 #24

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

ahomescu
Copy link
Contributor

@ahomescu ahomescu commented Feb 7, 2024

Add a naive implementation of CRC-32 based
on the C code from the PNG specification.

@ahomescu ahomescu force-pushed the ahomescu/crc32 branch 4 times, most recently from 72d4279 to 390d0e1 Compare February 13, 2024 08:16
@ahomescu ahomescu marked this pull request as ready for review February 13, 2024 08:18
Copy link

codecov bot commented Feb 13, 2024

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (fc8e11e) 77.64% compared to head (25708f1) 85.99%.
Report is 27 commits behind head on main.

Files Patch % Lines
zlib-rs/src/crc32.rs 82.30% 20 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #24      +/-   ##
==========================================
+ Coverage   77.64%   85.99%   +8.35%     
==========================================
  Files          28       29       +1     
  Lines        6388     6591     +203     
==========================================
+ Hits         4960     5668     +708     
+ Misses       1428      923     -505     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments.

For usage in the gzip logic it's important to have a function that can take in an existing checksum value and more bytes, and update the checksum with those bytes. It's not obvious how this implementation would do that to me (but hopefully that is simple to add?)

zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved
zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved
zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved
zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved
zlib-rs/src/crc32.rs Outdated Show resolved Hide resolved
@folkertdev
Copy link
Collaborator

also as a general observation this already seems bigger and more complex than what crc32fast does in the general case . e.g.

https://github.com/srijs/rust-crc32fast/blob/master/src/baseline.rs

is quite simple to follow. Am I missing something that is implemented here but not there (besides the tables, which apparently should preferably be stored in a static, not a const srijs/rust-crc32fast@e61ce6a I did not know this, but that is useful information).

@ahomescu
Copy link
Contributor Author

For usage in the gzip logic it's important to have a function that can take in an existing checksum value and more bytes, and update the checksum with those bytes. It's not obvious how this implementation would do that to me (but hopefully that is simple to add?)

Should be pretty simple, we can replace START_CHECKSUM at its uses with the given checksum value. I think even crc32_interleaved can support that.

@ahomescu
Copy link
Contributor Author

also as a general observation this already seems bigger and more complex than what crc32fast does in the general case

That should be pretty simple to add by using the tables from build_crc32_table. My implementation here is based on zlib-ng's "braid" algorithm, i.e., the "interleaved" approach from https://github.com/zlib-ng/zlib-ng/blob/develop/doc/crc-doc.1.0.pdf. Is that overkill? Up to you, let me know how you'd like to proceed.

@folkertdev
Copy link
Collaborator

In that case I think it makes more sense to follow the zlib-ng implementation more closely (https://github.com/zlib-ng/zlib-ng/blob/develop/arch/generic/crc32_braid_c.c). it seems to have diverged somewhat from the original paper. This is often my experience with CS literature: the way to write something up for an academic paper is generally not how you actually implement things. It will be interesting to see how this compares to the crc32fast baseline algorithm (I'd hope the zlib variant is faster, otherwise what are they doing, right?)

The implementation in that C file seems straightforward, but we can probably clean it up given that we're using rust. But I think it's a good idea to use the same name (braid), type signature, and generally follow its structure.

I think using an array for the N + 1 crcs and other values is a clever idea. LLVM should have no problem with that. The braid function can take N as a const N: usize> const generic so it's clearer why we're doing things that way.

There is some arithmetic with 0xFFFF_FFFF that we can simplify because we know a u32 is 32 bits, (not a c_int that might be wider). so an xor 0xFFFF_FFFF is just a bitwise negation for us.

using macros to unroll without duplicating too much code is fine (our adler32 implementation does this too).

@ahomescu ahomescu changed the title First part of issue #14 Implement generic CRC-32 Feb 14, 2024
@ahomescu ahomescu force-pushed the ahomescu/crc32 branch 3 times, most recently from 33a1230 to 5bec7b6 Compare February 14, 2024 03:08
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
@folkertdev
Copy link
Collaborator

could you add some #[allow(unused)] to make the clippy warnings go away? I'd do it myself but I think your git settings don't allow for it.


I've been playing around with the simd version, and from what I see zlib-ng does much more manual work than it really needs. The crc32fast implementation is basically the same speed at a fraction of the code size.

Benchmark 1 (87 runs): cargo run --release --example crc32_bench sse silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          57.3ms ± 3.38ms    55.5ms … 84.6ms          7 ( 8%)        0%
  peak_rss           21.8MB ±  138KB    21.6MB … 22.2MB          0 ( 0%)        0%
  cpu_cycles          151M  ± 3.99M      147M  …  181M           7 ( 8%)        0%
  instructions        325M  ± 46.0K      325M  …  325M           2 ( 2%)        0%
  cache_references   3.70M  ± 49.5K     3.59M  … 3.83M           0 ( 0%)        0%
  cache_misses        980K  ±  109K      832K  … 1.36M           5 ( 6%)        0%
  branch_misses       746K  ± 30.6K      732K  … 1.02M           3 ( 3%)        0%
Benchmark 2 (88 runs): cargo run --release --example crc32_bench crc32fast silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          57.4ms ± 2.19ms    55.5ms … 69.7ms          3 ( 3%)          +  0.1% ±  1.5%
  peak_rss           21.8MB ±  156KB    21.6MB … 22.2MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          152M  ± 4.85M      147M  …  185M           2 ( 2%)          +  0.6% ±  0.9%
  instructions        323M  ± 46.4K      323M  …  323M           1 ( 1%)          -  0.6% ±  0.0%
  cache_references   3.71M  ±  100K     3.61M  … 4.40M           3 ( 3%)          +  0.2% ±  0.6%
  cache_misses        995K  ±  135K      849K  … 1.40M           6 ( 7%)          +  1.6% ±  3.7%
  branch_misses       746K  ± 14.4K      732K  …  823K           5 ( 6%)          +  0.1% ±  0.9%

But that benchmark measures performance when all the data is available. Zlib can en/decode in a streaming fashion, and the crc32 algorithm it uses is tuned for that: crc32fast only stores a u32 as its state, but zlib-ng actually keeps 4 simd values around to store the state. So when the data comes in in small increments, zlib-ng should be saving a bunch of work. It's hard to tell how much that matters in practice though. I'll try to run some benchmarks for that too

Also we do need the fused crc with memcpy, and I don't think that fits into the current crc32fast design. Somehow all of these bits need to be synthesized together.

With all that, if you want to make more open source contributions, I think you all have a better setup for running avx-512 SIMD code, so you might look into adding a 512-bit simd version based on https://github.com/zlib-ng/zlib-ng/blob/develop/arch/x86/crc32_fold_vpclmulqdq_tpl.h to crc32fast itself. Assuming our code will be close to what crc32fast does we can then use that for zlib-rs down the line.

@folkertdev
Copy link
Collaborator

hah, wow

Benchmark 1 (84 runs): cargo run --release --example crc32_bench sse-chunked silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          59.3ms ± 1.77ms    57.0ms … 64.6ms          4 ( 5%)        0%
  peak_rss           21.8MB ±  156KB    21.5MB … 22.1MB          0 ( 0%)        0%
  cpu_cycles          158M  ± 3.51M      154M  …  174M           6 ( 7%)        0%
  instructions        354M  ± 49.4K      354M  …  354M           2 ( 2%)        0%
  cache_references   3.76M  ± 57.0K     3.67M  … 3.94M           1 ( 1%)        0%
  cache_misses       1.09M  ±  161K      880K  … 1.52M           3 ( 4%)        0%
  branch_misses       742K  ± 6.52K      731K  …  771K           1 ( 1%)        0%
Benchmark 2 (55 runs): cargo run --release --example crc32_bench crc32fast-chunked silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          90.8ms ± 3.19ms    87.9ms …  111ms          1 ( 2%)        💩+ 53.2% ±  1.4%
  peak_rss           21.8MB ±  151KB    21.5MB … 22.2MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          280M  ± 7.58M      274M  …  329M           1 ( 2%)        💩+ 77.1% ±  1.2%
  instructions        434M  ± 60.2K      434M  …  434M           3 ( 5%)        💩+ 22.6% ±  0.0%
  cache_references   3.78M  ±  115K     3.59M  … 4.48M           1 ( 2%)          +  0.6% ±  0.8%
  cache_misses       1.18M  ±  173K      934K  … 1.63M           0 ( 0%)        💩+  8.3% ±  5.2%
  branch_misses       747K  ± 14.8K      735K  …  831K           3 ( 5%)          +  0.6% ±  0.5%

that makes a big difference! This is running

            let mut h = crc32fast::Hasher::new_with_initial(0);

            for c in input.chunks(32) {
                h.update(c);
            }
            println!("{:#x}", h.finalize());

and the equivalent thing for the zlib-ng approach. So, that is something that we might also consider upstreaming after we've cleaned it up considerably (the crc32fast code looks a lot nicer than the zlib-ng version)

Copy link
Collaborator

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for implementing this!

I'll just fix the clippy issues on main.

@folkertdev folkertdev merged commit b348386 into trifectatechfoundation:main Feb 16, 2024
4 of 5 checks passed
@ahomescu
Copy link
Contributor Author

Most of the clippy issues should go away once another part of the code starts using these functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants