This repository contains a few tools used for the publication of E-ARK specifications:
- A Java SAX based XML processor for METS profiles, which performs the following tasks:
- schema aware parsing and validation of the METS profile;
- for each
<structural_requirements>
sub-element of the profile generate:- a markdown table of the sections metadata requirements from the profile requirements to
<sub-element>/requirements.md
; - a markdown heading and code block for any XML examples referenced in the profile requirements to
<sub-element>/examples.md
;
- a markdown table of the sections metadata requirements from the profile requirements to
- generates the following Appendices:
vocabs.md
generated from the profile's<controlled_vocabularies>
sub-elements;schema.md
generarated from the profile's<external_schema>
elements;requirements.md
, a complete list of all profile requirements in a single table;examples.md
larger examples generated from the profile's<Appendix>
elements;
- checks for duplicate requirement ID allocation; and
- reports gaps in the ID sequence (in preparation for deprecation reporting).
- A Docker image for Pandoc v2.5, which is used in the E-ARK publication workflow. The Pandoc version was released in November 2018. An upgrade to v2.19 release in August 2022 might be a good idea. An upgrade to v3 or higher might require more work/fixing breaking changes.
- Example metadata, templates and images to be used in the publication workflow.
- The common introductory elements used for all of the E-ARK specifications.
- The metadata and images required by the GitHub pages site and Jekyll build process:
The tools here are usually invoked as part of a publication workflow, e.g. from the E-ARK CSIP project. There are a few prerequisites for running the tools:
As often as not these will need to be deployed on some kind of continuous integration environment. For purpsoes of the documentation examples we will be using GitHub Actions workflows. The tools required are available/made available as required.
There are a few assumptions regarding the publication workflow:
- the specification to be published is based around a METS profile;
- the requirements, examples and appendices generated will be laid out in a speficic structure; and
- the generated site and PDF document will have the standard DILCIS look and feel, see https://earkcsip.dilcis.eu/ and https://earkcsip.dilcis.eu/pdf/eark-csip.pdf.
All of the above can be subverted if required, but that's the path of least resistance.
You can do the following:
- generate tables of requirements, example and appendices from a METS profile as GitHub flavoured Markdown;
- generated the same as Markdown suitable for conversion to PDF using Pandoc;
- use the templates and metadata to generate a GitHub pages site and PDF documents via the Pandoc Docker image.
- Preparation Ensure that the specifcation is ready for publication and that details like the version number and publication date are as required.
- Generated the pages site markdown from a METS profile
This is typically done via the spec-publisher Java project using the
master
branch, something like:java -jar target/mets-profile-proc.jar -o ../profile/E-ARK-CSIP.xml
. - Create the pages site markdown
Usually by running a Pandoc script using the Docker image, something like
docker run --rm -v "$PWD:/source" -u "$(id -u):$(id -g)" --entrypoint /source/create-site.sh eark4all/spec-pdf-publisher
. - Generate the PDF markdown from a METS profile
This is typically done via the spec-publisher Java project using the
feat/pdf-publication
branch, something like:java -jar target/mets-profile-proc.jar -o ../profile/E-ARK-CSIP.xml
. - Create the PDF
Usually by running a Pandoc script using the Docker image, something like
docker run --rm -v "$PWD:/source" -u "$(id -u):$(id -g)" --entrypoint /source/create-pdf.sh eark4all/spec-pdf-publisher
. - Use Jekyll to generate the website
This is usually done using the GitHub pages Docker box, e.g.
docker run --rm -v "$PWD"/docs:/usr/src/app -v "$PWD"/_site:/_site -u "$(id -u):$(id -g)" starefossen/github-pages jekyll build -d /_site
, which uses the./docs
directory as a source and generates a site in./_site
. - Publish the generated site to GitHub.
The specification publication process produces an E-ARK specification website and the PDF specification document. These can be generated from the following sources, or a combination of sources:
- a METS profile XML document describing the specifcation and its requirements (these are extracted by the spec-publisher Java project described below);
- markdown files for text content, e.g.
schema.md
,requirements.md
,examples.md
orappendices.md
; - HTML, or LaTex files for the same;
- images to accompany the text content, these can be included in the text source using the appropriate markdown or HTML syntax; and
- metadata in a top-level YAML metadata file, e.g.
metadata.yml
.
The directory structure for the specification is fairly arbitary. The files can be concatenated in any order to form a final specification document. That said, following a convention will make the process easier to manage, and share. The following is a typical structure:
archived/
- old versions of the specification PDF documents.
examples/
- example information packages in archive format.
profile/
- the METS profile XML documents for all versions of the specification.
schema/
- any supporting XML schema documents, e.g. METS extensions, the METS Profile and METS schema documents, etc.
spec-publisher/
- the spec-publisher Java project, if required.
specification/
- the text and image files that comprise the specification source.
The metadata for the specification is stored in a top-level YAML file within the specification directory, e.g. specification/metadata.md
. Currently the supported fields are:
title
: the title of the specification;subtitle
: the subtitle of the specification;abstract
: a short abstract of the specification;version
: the version number of the specification (usually templated, see below); anddate
: the release date of the specification (usually templated, see below).
Here's the CSIP as an example:
---
title: E-ARK CSIP
subtitle: Common Specification for Information Packages
abstract: |
This base profile describes the Common Specification for Information
Packages (CSIP) and the implementation of METS for packaging OAIS
rest of abstract here...
version: ${RELEASE_VERSION}
date: ${RELEASE_DATE}
---
The templated fields ${RELEASE_VERSION}
and ${RELEASE_DATE}
are replaced by the publication workflow, e.g. by the GitHub Actions workflow. The values are derived from the last git tag, or the current release tag.
This is a Java project and is built using Maven. You'll need a copy of this project sub-directory, from a git clone, git clone https://github.com/DILCISBoard/E-ARK-CSIP.git
or a source package download.
Note that there are effectively 2 forks of this project one for bu source package download.
From within this project sub-directory, e.g. mets-profile-processor
issue the Maven command: mvn clean package
to run tests and build.
It's just a basic SAX processor for the profile with some Markdown output.
Main entry point for fat JAR package, sequences parsing user input and running the SAX handler.
Parses the String
args array and records the user options in a dedicated
class.
SAX event driven handler for METS Profile, parses Requirements
lists from
Profile XML document.
Buffers XML element text and handles output (for now.....)
- Stronger data typing for
eu.dilcis.csip.profile.MetsProfileXmlHandler.Requirement
- Requirement validation, e.g. non-empty fields etc.
- Group think for other validation activities.
- Markdown table generation
-
index.md
file template selection -
index.md
file template substitution - Generalise vanilla METS Profile handling to base class
- fix SaxExceptions from OutputHandler class