An experiment in gathering together sources of information about digital preservation practices
This initial plan is to experiment with using Python to gather useful information sources, starting with iPres. Then see if this can usefully be transformed into something searchable using Datasette or Datasette Lite.
This originally relied on a tool called DVC. Why DVC? Because I wanted to manage how the data is complied, and I liked the way it handles checking data dependencies. Very #DigiPres... Also, e.g. remote storage integration for data sets on Google Drive.
However, having tried both DVC and Snakemake, they seem very difficult to work with. Lots of complex dependencies that don't always install easily, and over-engineered for this use case. So, instead, build pipelines are manage the old-fashioned way, using Make. There's lots of tutorials for Make (e.g.), and the Turing Way book has a really good section called Reproducibility with Make.
You need Python 3 and Make.
Clone this repo. Set up a Python 3 virtual env, e.g.
python3 -m venv .venv
source .venv/bin/activate
Install dependencies:
pip install .
Optionally, install NLP data required for some analysis/processing (not in production use):
python -m spacy download en_core_web_lg
Build the data:
Try the Datasette view:
datasette serve practice.db --setting truncate_cells_html 120
After which you should be able to go to e.g.
Other build targets generate other derivatives. Check the Makefile for details.
Where are the papers and metadata... Links on are not complete.
It may make more sense to use JSON to store this data, and use JSON Schema in VSCode to make it easer to edit them. That can then be consumed by the gathering scripts as well as being used to generate tabular forms like this.
The information about each iPRES conference is now stored as a set of Markdown+metadata files in the publications
repository, and are summarised at
- Seems to contain PDFs of individual contributions and whole-conference proceedings documents.
- There are e.g. posters as well as articles and it should be possible to distinguish them.
- There does not seem to be an other materials, e.g. links to recordings, etc.
- Has a kind of implicit API, seems to expose parts of Solr, e.g. items from the iPRES 2004 collection which can be simplified to this
- Might be easier to just download the CSV for each iPRES collection manually, and then use the object IDs.
- Once you have an object ID for an article, it's straightforward to get:
- More recent conferences appear in OSF, which has a much more complicated structure, but allows more types of materials to be stored.