Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA256 or SHA-256? #80

Open
nclarkekb opened this issue Jun 20, 2022 · 17 comments
Open

SHA256 or SHA-256? #80

nclarkekb opened this issue Jun 20, 2022 · 17 comments

Comments

@nclarkekb
Copy link

nclarkekb commented Jun 20, 2022

I would like to know the position on this.
JWAT used the algorithm specified in the digest header directly.
So JWAT expects "SHA-256" since that seems to be the official name and the name supported by JCE.
But now I see webrecorder uses "SHA256" which then fails.
Maybe Python uses SHA256 instead of SHA-256?

@JustAnotherArchivist
Copy link

'SHA-256' is indeed the official name. But that's true for 'SHA-1' as well, yet the digest headers almost universally use the format sha1:foo (since that's what's used in the specification's examples and other documents), so that doesn't mean much. I'd probably go with sha256:foo if I were to implement writing such records, simply for consistency. On parsing, I'd keep it generic and support any capitalisation and an optional dash.

In general though, the digest format is very underspecified, which is understandable but a shame. Not only is the format of the digest algorithm undefined, the same is true for the digest encoding as well: base32 is the common one, but hexadecimal ones exist as well, for example.

@ato
Copy link
Member

ato commented Jun 21, 2022

Observations of WARCs found in the wild (web searches, looking at various tools):

  • Observed algorithms: SHA-1, MD5, SHA-256
  • SHA-1 is usually uppercase Base32 encoded (as recommended by the spec).
  • MD5 and SHA-256 seem to be usually lowercase Base16 (hexadecimal) encoded.
  • warcio.js originally used sha-256: but changed to using sha256:

(Proposing some guidelines.)

Guidelines for writing WARC files:

  • Output the labels in lowercase as shown in the table below. Do not output the compatibility labels.
  • When outputting Base16 use lowercase.
  • When outputting Base32 use uppercase.

Guidelines for reading WARC files:

  • Normalize the label by ASCII lowercasing it and replacing compatibility labels with the recommended label (SHA-1: -> sha1:).
  • Determine whether the encoding is Base16 or Base32 using the length of the encoded digest (see table).
    • † In the case of MD5 the Base16 and Base32 encoding are both 32 characters long. Detect Base32 by the presence of a padding "=" character at the end.
  • When comparing digests to each other first decode or normalize the base encoding.
  • Accept both ASCII uppercase and ASCII lowercase letters when decoding the Base16 or Base32 encoding.
Algorithm Label Compat. label Popular encoding Base16 length Base32 length
MD5 md5: lowercase Base16 32† 32†
SHA-1 sha1: sha-1: uppercase Base32 40 32
SHA-224 sha224: sha-224: 56 48
SHA-256 sha256: sha-256: lowercase Base16 64 56
SHA-384 sha384: sha-384: 96 80
SHA-512 sha512: sha-512: 128 104
SHA-512/224 sha512-224: 56 48
SHA-512/256 sha512-256: 64 56
SHA3-224 sha3-224: 56 48
SHA3-256 sha3-256: 64 56
SHA3-384 sha3-384: 96 80
SHA3-512 sha3-512: 128 104
BLAKE2s blake2s: 64 56
BLAKE2b blake2b: 128 104

@chris-ha458
Copy link
Contributor

chris-ha458 commented Jul 15, 2023

Although this thread does not strictly serve as a thread recommending hash algorithms, since I have been directed here by developers of CC here to contribute to the discussion I shall do so thusly.

It is my personal opinion that both speed and cryptographic aspects must be considered.
Many LLMs are now relying on Common Crawl and its derivatives as an important upstream data provider, so we must consider the fact that Common Crawl or other WARC producers become targets of "data supply chain" attacks.
As such, any additional security properties should be considered beneficial.
However, processing cost is an important consideration that must be considered as well.
Under this assumption, I would like to recommend 2 different algorithm categories.

Noncryptographic Algorithms

These algorithms consider a pareto optimum outside cryptographic security.
That is not to say these are not cryptographically secure; however they have not been designed and validated according to cryptographic principles and are likely will not undergo such considerations in the future.
That being said, due to above reasons, any algorithm with actively known vulnerability should not be used.

  • The XXH family
    These algorithm library provides a vectorized XXH3(64bit), XXH128 (128bit) which supported by both x86 and ARM platforms, and the non vectorized XXH64. (there is XXH32 but it is not faster).
    According to the official github, they are 30~100 times faster than SHA1

    It is also considered a high quality hash according to smhasher

There are other widely used hash functions but this is the only one that I could find that was fast enough, pass all smhasher tests and is portable (not just x86 32,64bit but ARM)

Cryptographic Algorithms

I would like to recommend 2 algorithms each that serves different functions.
SHA3-256 is a NIST released algorithm, and BLAKE3 is a cryptographically secure algorithm that is very fast.

  • SHA-384
    Among the SHA-256 family, SHA-256 and SHA-512 can be vulnerable to length extension attacks. SHA-384 has 128 bits of length extension resistance. Since the WARC format expects to have lengths included so this is somewhat ameliorated but it cannot hurt to have another layer of defence if we are aiming for cryptographic security. Due to being an NIST algorithm it is amongst the most portable.

  • BLAKE3
    Amongst the non NIST algorithms, BLAKE3 is amongst the fastest. Since it is a singular algorithm (as opposed to multiple different variants of BLAKE2) choice is easier. It is anywhere between 6~15 times faster than other NIST algorithms.
    It is known to have official Rust and C implementations and is compatible with x86 and ARM architectures.

@jcushman
Copy link

This thread doesn't yet mention SHA-512/256, which is very like SHA-384: it's a length-extension-resistant NIST algorithm based on truncating SHA-512, and has a similar security budget to SHA-384. Because it truncates to 256 bits instead of 384 the strings are two thirds as long, which I like for applications where humans see them.

A practical question related to the original issue would be what labels to use -- would sha512/256: work or would the slash cause a problem?

(As a total sidetrack, I like the idea of length extension resistance since it can be had for free, so why not? But I don't know of a threat model where length extension is relevant to digests.)

@chris-ha458
Copy link
Contributor

I think SHA-512/256 is a good alternative too, but they practical question you raise would definitely have to be answered before any real usage. I didn't have a good answer that would definitely work downstream, so i just chose SHA-384.

As for what the threat model would be, I do not know. However, I do not want to find out the hard, difficult way, and it is already established that changing hashes are not easy to accomplish due to legacy code and downstream dependencies. As such maximalist security approaches are justified imo.

@jcushman
Copy link

I agree about the precautionary principle!

It does strike me as worth writing down somewhere what the security model for digests is. I'm not qualified, but my starting place would be like:

  • Uniform distribution DOES matter, for deduplication and fixity checks
  • Preimage resistance DOES matter, so attackers can't make pages that mess with deduplication
  • Other properties do NOT matter(?), because digests aren't signatures

This could be relevant because there are archive provenance efforts like wacz-auth or C2PA which do involve signatures. E.g.:

  • If someone is picking a digest format for its crypto properties in the first place, that might mean they need to be referred to a signature scheme
  • If someone is designing a signature scheme, they might need to know which digest formats are sufficient to establish a link between headers and data

Just rambling on the weekend... :)

@JustAnotherArchivist
Copy link

I mostly agree, but collision resistance might also be somewhat important, else bodies that are actually different might get deduped. For example, if there is a public web archival service that uses deduplication on the payload digest, an attacker could put one version of the content into the collection but then serve something different with the same hash on their website. Visitors would see the latter version, but the archival service would always return the former instead. (And this would not rely on things like detecting when the archival service accesses the attacker's website and returning a different response, which is of course always possible.)

Would it be worth allowing the WARC-Block-Digest and WARC-Payload-Digest headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.

(Side note, WARC/1.1 calls SHA-1 'a strong digest function'. It isn't, and it wasn't at the time of publication either. WARC/1.1 was published in August 2017. The SHAttered collision came out half a year earlier in February 2017, and the SHAppening feasible collision attack was out in October 2015.)

@ato
Copy link
Member

ato commented Jul 16, 2023

would sha512/256: work or would the slash cause a problem?

Unfortunately / is disallowed by the grammar in the algorithm name as it is part of separators. I've added SHA-512/256 as sha512-256: to the guidelines table.

labelled-digest   = algorithm ":" digest-value
algorithm         = token
digest-value      = token
token         = 1*<any US-ASCII character>
                except CTLs or separators>
separators    = "(" | ")" | "<" | ">" | "@"
              | "," | ";" | ":" | "\" | <">
              | "/" | "[" | "]" | "?" | "="
              | "{" | "}" | SP | HT

@jcushman
Copy link

I mostly agree, but collision resistance might also be somewhat important

Ah, good point. In the threat model, an adversary should not be able to add two documents to a web archive with the same digest.

@chris-ha458
Copy link
Contributor

@ato would it be possible to add another column describing cryptographic security?

I would hope some rows could be added including xxh3 and blake3.

would I need to make a separate PR?

Algorithm Label Compat. label Typical encoding Base16 length Base32 length Cryptographic
MD5 md5: lowercase Base16 32 32 Yes, but broken.
SHA-1 sha1: sha-1: uppercase Base32 40 32 Yes, but broken.
XXH3 xxh3: 16 No, but really fast
XXH128 xxh128: 32 No, but really fast
SHA-224 sha224: sha-224: 56 48 Yes
SHA-256 sha256: sha-256: lowercase Base16 64 56 Yes, but vulnerable to length extension
SHA-384 sha384: sha-384: 96 80 Yes
SHA-512 sha512: sha-512: 128 104 Yes, but vulnerable to length extension
SHA-512/224 sha512-224: 56 48 Yes
SHA-512/256 sha512-256: 64 56 Yes
SHA3-224 sha3-224: 56 48 Yes
SHA3-256 sha3-256: 64 56 Yes
SHA3-384 sha3-384: 96 80 Yes
SHA3-512 sha3-512: 128 104 Yes
BLAKE2s blake2s: 64 56 Yes
BLAKE2b blake2b: 128 104 Yes
BLAKE3 blake3: 64 Yes, and really fast

@chris-ha458
Copy link
Contributor

Would it be worth allowing the WARC-Block-Digest and WARC-Payload-Digest headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.

I think allowing multiple digests would be the way to go.

First, it would make any attack exponentially difficult. If each hashes are cryptographically secure it would provide atleast the sum of their security in bits.

Secondly, it would provide a safe buffer for migration. Even if one of the hashes are deemed to have a security vulnerability the existence of another would allow security to be maintained while an alternative is implemented.

@wumpus
Copy link

wumpus commented Jul 16, 2023

I really prefer cryptographic hashes because they help me understand how to use them correctly. Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.

I'm OK with having multiple digests although I think we should test a bunch of existing warc libraries to see how many break.

@JustAnotherArchivist
Copy link

The standard does not currently allow repeating the digest headers, so I'm not sure that's relevant. A library that currently accepts repeated headers is not implementing the standard correctly. It could be allowed in a future WARC version, and then libraries implementing support for that version would have to also support the repetitions.

@wumpus
Copy link

wumpus commented Jul 16, 2023

Most libraries aren't standard-conforming, so it's worth checking how much of a change this proposal will be.

@chris-ha458
Copy link
Contributor

I really prefer cryptographic hashes because they help me understand how to use them correctly

The specification as it is, does not recommend any hash. Therefore the choice is up to the library developers as it is. Common Crawl, for instance, is using SHA-1 which is neither fast nor cryptographically secure.
As a standard, I think it might be difficult to dictate a cryptographically secure standard, especially such securities depend on a lot including threat models. The specification might be able mention that such qualities exist, and should be considered by developers in selecting a hash.

Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.

I am considering the viablity of this right now. Whether it is a feasible choice either independently or not, is not the scope of this discussion (although i'd welcome to discuss it further in another setting), but the fact remains that these digests exist in both the specification and individual downstream projects such as Common Crawl and that it makes it a very enticing option.

@wumpus
Copy link

wumpus commented Jul 17, 2023

I'm confused by your comment. I wasn't suggesting that the standard should recommend or dictate any hash. If you think my comment isn't in scope for the discussion, I'm happy to not participate.

@chris-ha458
Copy link
Contributor

I apologize for the confusion.
It seems that I was the one to have opened up this issue into a wider discussion so it seems unfair for me to now arbtitrarily limit the scope of discussion.
To address this, I will open a new PR and associated issue to discuss potential new community hash digest recommendations and any qualities(cryptographic security, speed) and functionalities(downstream dedup etc) that such hash must consider.

As i've said, i welcome such discussions especially since I am working on such things right now. I merely wished to avoid this issue being diluted too much (I must admit was my fault).

NGTmeaty added a commit to CorentinB/warc that referenced this issue Mar 20, 2024
Base16 appears to be the most common SHA-256 encoding. As such, we will check based on that.
iipc/warc-specifications#80 (comment)
CorentinB added a commit to CorentinB/warc that referenced this issue Apr 5, 2024
* Add: warc extract

* Add: results report printing

* Oops, forgot to push utils.go

* Add .gitignore for output folder

* fix: improve support for spaces in the msgtype.

* fix: add warc executable and warc files to ignore.

* Add: warc verify

* Update cmd/verify.go

* small cosmetic fix

* fix: we currently cannot process revisit records.

this is currently outside of the scope of this tool, but could be added in the future.

* feat: add gzip content decoding

* fix: revisit records in verify

* small cosmetic fix

* fix: revisit if statement

* feat: add folder structure to extract output.

* fix: add support for SHA-256 Base16 verify support

Base16 appears to be the most common SHA-256 encoding. As such, we will check based on that.
iipc/warc-specifications#80 (comment)

* Add: --host-sort

* Truncate filenames too long

* Cmd/extract: use filename from Content-Disposition only when it's not empty

* cmd/extract: replace / in filenames

* cmd/extract: handle mime parsing failure

* feat: add (default) support to suffix duplicate file names with a SHA1 hash if they are different.

* fix: resolve EOF read error

---------

Co-authored-by: Jake L <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants