This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 144
Error when Running 2020-34 dumps #16
Comments
Thanks for flagging. It seems that CC 2020-34 has added a new header: "WARC-Identified-Content-Language". |
@gwenzek |
2022-05 as well. |
You can replace https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#L73-L79 with headers_map = {}
for header in headers[1:]:
if not header:
continue
key, value = header.split(": ", 1)
headers_map[key] = value
warc_type = headers_map["WARC-Type"]
if warc_type != "conversion":
return None
url = headers_map["WARC-Target-URI"]
date = headers_map["WARC-Date"]
digest = headers_map["WARC-Block-Digest"]
length = int(headers_map["Content-Length"]) in order to carefully process a new added header |
akeyhero
added a commit
to akeyhero/cc_net
that referenced
this issue
Jul 19, 2023
implemented by @shmpanski at facebookresearch#16 (comment)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When Running the full pipeline with the newest dumps (e.g. 2020-34), there seem to be an issue with the header file format.
It only seem to occur on Texts with non Latin Alphabet. Due to this issue one cannot run the hashing pipeline on some newer dumps. The last successfull dump which I could successfully process was 2020-10.
Are there any quick-fixes available for this problem?
The text was updated successfully, but these errors were encountered: