Make a WARC from Versionista Data in the EDGI WM DB

This is pretty much a one-off script to generate WARCs for permanent archival with our old data from Versionista. We don’t expect to need to run this again, but there may be a bunch of useful bits and bobs here for reference. The code here is not as nice or clean as it could be.

We use uv to manage this project.

Setup:
```
uv sync
```
Run script to output WARCs at paths like archive/edgi-wm-versionista--2018-02-13T012233.warg.gz (warnings and other logs will be written to archive/log.txt; you can tail that file to see live notices):
```
uv run edgi-versionista-warc.py --limit 100 archive/
```
Use --help to see options.

Notes

WARC spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1-annotated/

Implementation Guidelines: https://iipc.github.io/warc-specifications/guidelines/warc-implementation-guidelines/

WARC-Record-ID: warcio can invent one, but we’ve set custom ones based on our API and data structure (so they include the ID of version records, but are structured as a URN).

Remember to formulate multiple responses when source_metadata.redirects is populated. Last item is the actual URL we were redirected to. Also source_metadata.redirect_url.

Format a metadata record from the source_metadata field with additional info.

We have some FTP directories and files we monitored; those need a different kind of WARC record.

Need a lot of extra code to manage a set of WARC files so they don’t get too big. Unfortunately warcio doesn’t do this.

There were a lot of records with missing status codes (I expected this to be only 50k at most out of our 845k total, but it turns out to be more like 340k!), so this includes a port of the algorithm that guesses the effective status code in web-monitoring-db.

Questions

Wayback CDX uses SHA-1 base32 hashes; should we use same for WARC-Block-Digest? (Answer: yes; it's built-in to warcio and that's fine enough)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
edgi_versionista_warc		edgi_versionista_warc
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
create-log-warc.py		create-log-warc.py
edgi-versionista-warc.py		edgi-versionista-warc.py
example_record.json		example_record.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Make a WARC from Versionista Data in the EDGI WM DB

Notes

Questions

About

Releases

Sponsor this project

Packages

Languages

edgi-govdata-archiving/edgi-versionista-warc

Folders and files

Latest commit

History

Repository files navigation

Make a WARC from Versionista Data in the EDGI WM DB

Notes

Questions

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages