A script to convert EMu XML exports to JSON for cultural data. Note - For record-sets > 1000, this can be slow. It is based on a combination of standards: Darwin core, Latimer core, Audubon Core and the very draft H2I extension to ABCD. We are currently testing it as a sustainable way to share data with partners from Digital Benin and Mapping Philippine Material Culture
-
Export a set of records as XML.
- See the list of exported EMu fields in the registry entries here: H2I_emu_reports_eregistry.csv
- Catalog export of record data called: H2I_JSON_Export_DarwinCore
- Corresponding Registry entry for report called 'H2I_DwC_Export_Report'
- Catalog export of associated Multimedia called: H2I_JSON_Export_AudubonCore
- Corresponding Registry entry for report called 'H2I_AC_Export_Report'
-
When possible, check that the XML output is well-formed.
- 'Scholarly XML' VSCode add-on
- xmllint on Mac/Unix/Linux or xsltproc on Windows -- Try this in terminal/shell:
xmllint --noout file.xml && echo $?
- Online XML validator Warning - avoid online-validators for sensitive data.
-
Set up these two CSV's following instructions/examples from the 'Input' section:
- emu_fields.csv - to map EMu-column-names to corresponding standard-term names.
- emu_conditions.csv - for EMu-fields and values that need conditional mapping or redaction.
-
Clone this repo
-
Set local environment variables by adding a text file named .env in the root directory of this repo. Open it and follow the .env.example file. More info here if needed.
IN_PATH
= the path to your input XML fileMAP_PATH
= the path to your emu_conditions.csvOUT_PATH
&LOG_PATH
= the path where you want the output JSON and log files to goFROM_ADD
&TO_ADD1
= the sender and recipient email addresses for notifications- other examples-variables are included for using
'mutt
to send notifications from a server
-
Install Python 3.9 or later. To send email notifications from a server, also install mutt. (e.g. Ubuntu wiki)
-
Install the python packages listed in required.txt with
pip
orpip3
:pip3 install charset-normalizer json pandas python-decouple xml xmltodict
-
Run the script:
python3 emu_xml_to_json.py
-
The newest file in your .env file's 'IN_PATH' is the default XML-input.
-
Alternatively, you can manually specify a different XML-input like so:
python3 emu_xml_to_json.py data_in/2021-08-08/sample.xml
-
-
Output JSON, XML and log are zipped and emailed. See JSON output in emu_to_json.json, or check for errors in xml_log_YYYYMMDD.txt
An XML file containing records exported from EMu as XML, with some or all EMu-fields listed in emu_fields.csv
- example here of well-formed input
- example here of input with a badly-encoded character
emu_fields.csv (example)
A 5-column CSV that maps EMu-column-names to corresponding standard-term names, using the following columns:
emu
= EMu column-names- Exclude any "Ref" and "_tab" suffixes here. (This should match the lowest-level tag name in the EMu XML.)
json_field
= corresponding h2i standard term namesrepeatable
= blank or 'yes' to indicate if multiple values can be assigned to json_fieldemu_group
= in the EMu export's 'Group' name, or the table or Reference column name- Include the "Ref" and/or "_tab" suffix (if any) for the corresponding
emu
field.
- Include the "Ref" and/or "_tab" suffix (if any) for the corresponding
json_container
= the group name for a set ofjson_fields
that should be nested together in the output JSON
emu_conditions.csv (example)
A 7-column CSV that defines logic for conditionally redacting or mapping rows in multi-value-tables to standard terms.
if_field1
= the input EMu-field whose value defines a conditionif_logic1
= the logical comparison for the condition (e.g. if the field "IS" or "IS NOT" equal to if_value1)if_value1
= the input value.- Use "NOT NULL" for "any non-blank input value"
then_field
= the input EMu-field (if any) whose value should be transformed or redacted.json_field
= the output json_field that should be set (conditionally) to the value instatic_value
.- Use "NULL" to redact an output value from the
then_field
- Use "NULL" to redact an output value from the
static_value
= the output value used if an input field matches conditions in the if_field1 & if_value1json_container
= the output field's group, if any
- emu_to_json.json = Records as JSON objects, with EMu-fields/data as key/value pairs.
- emu_prepped.xml (optional) = XML with EMu column-names as xml-tags instead of xml-attributes
- xml_log_YYYY-MM-DD.txt = Log of successful or failed output.
- email notifications (currently requires mutt to send email, or gmail-only recipients)
- Log-messages comprise the email-body
- Output files are zipped and attached