-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHA256 or SHA-256? #80
Comments
'SHA-256' is indeed the official name. But that's true for 'SHA-1' as well, yet the digest headers almost universally use the format In general though, the digest format is very underspecified, which is understandable but a shame. Not only is the format of the digest algorithm undefined, the same is true for the digest encoding as well: base32 is the common one, but hexadecimal ones exist as well, for example. |
Observations of WARCs found in the wild (web searches, looking at various tools):
(Proposing some guidelines.) Guidelines for writing WARC files:
Guidelines for reading WARC files:
|
Although this thread does not strictly serve as a thread recommending hash algorithms, since I have been directed here by developers of CC here to contribute to the discussion I shall do so thusly. It is my personal opinion that both speed and cryptographic aspects must be considered. Noncryptographic AlgorithmsThese algorithms consider a pareto optimum outside cryptographic security.
There are other widely used hash functions but this is the only one that I could find that was fast enough, pass all smhasher tests and is portable (not just x86 32,64bit but ARM) Cryptographic AlgorithmsI would like to recommend 2 algorithms each that serves different functions.
|
This thread doesn't yet mention SHA-512/256, which is very like SHA-384: it's a length-extension-resistant NIST algorithm based on truncating SHA-512, and has a similar security budget to SHA-384. Because it truncates to 256 bits instead of 384 the strings are two thirds as long, which I like for applications where humans see them. A practical question related to the original issue would be what labels to use -- would (As a total sidetrack, I like the idea of length extension resistance since it can be had for free, so why not? But I don't know of a threat model where length extension is relevant to digests.) |
I think SHA-512/256 is a good alternative too, but they practical question you raise would definitely have to be answered before any real usage. I didn't have a good answer that would definitely work downstream, so i just chose SHA-384. As for what the threat model would be, I do not know. However, I do not want to find out the hard, difficult way, and it is already established that changing hashes are not easy to accomplish due to legacy code and downstream dependencies. As such maximalist security approaches are justified imo. |
I agree about the precautionary principle! It does strike me as worth writing down somewhere what the security model for digests is. I'm not qualified, but my starting place would be like:
This could be relevant because there are archive provenance efforts like wacz-auth or C2PA which do involve signatures. E.g.:
Just rambling on the weekend... :) |
I mostly agree, but collision resistance might also be somewhat important, else bodies that are actually different might get deduped. For example, if there is a public web archival service that uses deduplication on the payload digest, an attacker could put one version of the content into the collection but then serve something different with the same hash on their website. Visitors would see the latter version, but the archival service would always return the former instead. (And this would not rely on things like detecting when the archival service accesses the attacker's website and returning a different response, which is of course always possible.) Would it be worth allowing the (Side note, WARC/1.1 calls SHA-1 'a strong digest function'. It isn't, and it wasn't at the time of publication either. WARC/1.1 was published in August 2017. The SHAttered collision came out half a year earlier in February 2017, and the SHAppening feasible collision attack was out in October 2015.) |
Unfortunately
|
Ah, good point. In the threat model, an adversary should not be able to add two documents to a web archive with the same digest. |
@ato would it be possible to add another column describing cryptographic security? I would hope some rows could be added including xxh3 and blake3. would I need to make a separate PR?
|
I think allowing multiple digests would be the way to go. First, it would make any attack exponentially difficult. If each hashes are cryptographically secure it would provide atleast the sum of their security in bits. Secondly, it would provide a safe buffer for migration. Even if one of the hashes are deemed to have a security vulnerability the existence of another would allow security to be maintained while an alternative is implemented. |
I really prefer cryptographic hashes because they help me understand how to use them correctly. Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day. I'm OK with having multiple digests although I think we should test a bunch of existing warc libraries to see how many break. |
The standard does not currently allow repeating the digest headers, so I'm not sure that's relevant. A library that currently accepts repeated headers is not implementing the standard correctly. It could be allowed in a future WARC version, and then libraries implementing support for that version would have to also support the repetitions. |
Most libraries aren't standard-conforming, so it's worth checking how much of a change this proposal will be. |
The specification as it is, does not recommend any hash. Therefore the choice is up to the library developers as it is. Common Crawl, for instance, is using SHA-1 which is neither fast nor cryptographically secure.
I am considering the viablity of this right now. Whether it is a feasible choice either independently or not, is not the scope of this discussion (although i'd welcome to discuss it further in another setting), but the fact remains that these digests exist in both the specification and individual downstream projects such as Common Crawl and that it makes it a very enticing option. |
I'm confused by your comment. I wasn't suggesting that the standard should recommend or dictate any hash. If you think my comment isn't in scope for the discussion, I'm happy to not participate. |
I apologize for the confusion. As i've said, i welcome such discussions especially since I am working on such things right now. I merely wished to avoid this issue being diluted too much (I must admit was my fault). |
Base16 appears to be the most common SHA-256 encoding. As such, we will check based on that. iipc/warc-specifications#80 (comment)
* Add: warc extract * Add: results report printing * Oops, forgot to push utils.go * Add .gitignore for output folder * fix: improve support for spaces in the msgtype. * fix: add warc executable and warc files to ignore. * Add: warc verify * Update cmd/verify.go * small cosmetic fix * fix: we currently cannot process revisit records. this is currently outside of the scope of this tool, but could be added in the future. * feat: add gzip content decoding * fix: revisit records in verify * small cosmetic fix * fix: revisit if statement * feat: add folder structure to extract output. * fix: add support for SHA-256 Base16 verify support Base16 appears to be the most common SHA-256 encoding. As such, we will check based on that. iipc/warc-specifications#80 (comment) * Add: --host-sort * Truncate filenames too long * Cmd/extract: use filename from Content-Disposition only when it's not empty * cmd/extract: replace / in filenames * cmd/extract: handle mime parsing failure * feat: add (default) support to suffix duplicate file names with a SHA1 hash if they are different. * fix: resolve EOF read error --------- Co-authored-by: Jake L <[email protected]>
I would like to know the position on this.
JWAT used the algorithm specified in the digest header directly.
So JWAT expects "SHA-256" since that seems to be the official name and the name supported by JCE.
But now I see webrecorder uses "SHA256" which then fails.
Maybe Python uses SHA256 instead of SHA-256?
The text was updated successfully, but these errors were encountered: