Draft: Data Control Reports

What information do we want to capture and report to data control? Largely this falls into two groups:

If a datapoint doesn't make it through the indexing pipeline or discovery metadata expectations aren't met.
If metadata isn't valid or complete enough for discovery purposes.

For point 1, how is this captured & reported:

internal logging & monitoring directly out of end to end indexing pipeline to the developers team plus certain stakeholders. Goal here is to capture any errors (functional or data) from the processing, then act on (perhaps triage by devs, perhaps by subset of people on Gryphon*, perhaps elsewhere).
- logs from Symphony retrieval + processing
- logs from Traject pipeline itself
- logs from Solr loading
- results from running of test suite (if / when run as part of a indexing process)
- any possibly related Honeybadger notifications.

For point 2, how is this captured & reported: define a spec for what data needs to appear in MARC records for discovery purposes. So this isn't validating a MARC record, but validating we have data that fits our discovery needs.

For starters, the baseline data requirements:
- subfield code is missing
- duplicate fields that are non-repeatable (question: declare what is not repeatable)
- required fields: LDR, 001, title, ...? (this may be already covered by Symphony)
Iterate on adding new specifications
- check facets fields (more nice to check then required);
- call numbers checks (being worked on);
- other reported issues from above or other sources, added as data checks as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Data Control Reports

Clone this wiki locally