Skip to content

Releases: commoncrawl/cc-downloader

v0.5.1

21 Jan 22:34
v0.5.1
ac7af85
Compare
Choose a tag to compare
v0.5.1 Pre-release
Pre-release

Today we are happy to announce cc-downloader, an experimental command-line tool for downloading Common Crawl data via https. cc-downloader is intended to be a user-friendly and polite downloader. It was made in response to the significant increase in downloads of our data in recent months. That was very exciting to see at first, especially in terms of the large rise in interest for our dataset. But it also makes it harder for some users to successfully download our data due to quirks of downloading from a high-traffic storage bucket.

cc-downloader is our solution to this problem, enabling our users to continue downloading our data via https without issues. We have designed cc-downloader with a polite retry mechanism that allows our users to make sure that every single file requested is downloaded. It also implements jitter and exponential backoff strategies, in order to avoid overwhelming our infrastructure.

If you wish to install cc-downloader, we have released pre-compiled binaries here for all major operating systems and architectures. cc-downloader is written in Rust and is distributed as a “crate”, so if you have cargo installed, you can also install cc-downloader with this command:

cargo install cc-downloader

Once you have installed it, you’ll see that cc-downloader has 2 sub-commands:

First, download-paths downloads the file paths list for a given crawl and subset from our bucket, to a given destination folder path in your file system:

cc-downloader download-paths CC-MAIN-2024-46 wet path/to/folder

This paths file will be (in this case) path/to/folder/wet.paths.gz.

Next, download reads this file paths list and concurrently downloads the files to a given destination folder in your file system:

cc-downloader download path/to/folder/wet.paths.gz path/to/folder

This will preserve the tree structure that we use internally by default.

cc-downloader is still under active development, so if you find any issue or would like to submit a feature request, please visit our GitHub repository https://github.com/commoncrawl/cc-downloader/. Contributions are always welcome! We hope that with this tool our users will find it easier to download and use our data.

If you’re encountering any problems with cc-downloader that look like high traffic, you can check out our current traffic levels by looking at our infrastructure status webpage.

v0.5.0

13 Jan 05:08
v0.5.0
0a76740
Compare
Choose a tag to compare
v0.5.0 Pre-release
Pre-release
First pre-release of cc-downloader