You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download, and from past experience with gunzip, I know it would take a similar amount of time to decompress and write to disk.
I tried running warcat's extraction feature on it, reasoning that it would run at near-gunzip speed since this is a stream processing task and the IO to write each warc to a different file should have minimal overhead. Instead, it was extremely slow despite being the only thing running on that server, taking what seemed like multiple seconds to extract each file. In top, warcat was using 100% CPU, though un-gzipping is something that should be IO-bound, not CPU-bound (which indicates an algorithmic problem somewhere to me). After 3 days, it still had not extracted all files from the megawarc and I believe it was less than 3/4s done; unfortunately, it crashed at some point on the third day, so I didn't find out how long it would take. (I also don't know why the crash happened - and at 3 days to get to another crash with more logging, I wasn't going to find out.)
This slowness makes warcat not very useful for working with a megawarc and I wound up looking for a completely different approach (dd using the CDX metadata on index/length of the specific warcs I needed).
The text was updated successfully, but these errors were encountered:
Yeah, the thing is ridiculously slow. You won't expect ungzip performance because it's not just ungzip. When it's extracting, it has to parse each WARC record because it's human-readable and string processing in Python is not that great.
I've been meaning to write a faster application implementation but I haven't gotten the incentive to do so. 😞
I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download, and from past experience with gunzip, I know it would take a similar amount of time to decompress and write to disk.
I tried running warcat's extraction feature on it, reasoning that it would run at near-gunzip speed since this is a stream processing task and the IO to write each warc to a different file should have minimal overhead. Instead, it was extremely slow despite being the only thing running on that server, taking what seemed like multiple seconds to extract each file. In
top
, warcat was using 100% CPU, though un-gzipping is something that should be IO-bound, not CPU-bound (which indicates an algorithmic problem somewhere to me). After 3 days, it still had not extracted all files from the megawarc and I believe it was less than 3/4s done; unfortunately, it crashed at some point on the third day, so I didn't find out how long it would take. (I also don't know why the crash happened - and at 3 days to get to another crash with more logging, I wasn't going to find out.)This slowness makes warcat not very useful for working with a megawarc and I wound up looking for a completely different approach (
dd
using the CDX metadata on index/length of the specific warcs I needed).The text was updated successfully, but these errors were encountered: