Skip to content

Commit

Permalink
added initial version of CLARIN presentation
Browse files Browse the repository at this point in the history
  • Loading branch information
proycon committed Oct 7, 2024
1 parent 866cd45 commit cb42850
Show file tree
Hide file tree
Showing 11 changed files with 128 additions and 0 deletions.
3 changes: 3 additions & 0 deletions papers/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,6 @@ tooldiscovery.pdf:
biber tooldiscovery
pdflatex tooldiscovery
pdflatex tooldiscovery

presentation.pdf:
pandoc -t beamer presentation.md -o $@
Binary file added papers/archslide1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added papers/archslide2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added papers/archslide3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
125 changes: 125 additions & 0 deletions papers/presentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: FAIR Tool Discovery -- An automated software metadata harvesting pipeline for CLARIAH
author: Maarten van Gompel, KNAW Humanities Cluster
date: October 7th, 2024
---

## Introduction

* For scholars it is important to find and identify tools suitable for their research.
* Users must be able to make an informed enough judgement
* Many software catalogues exist, but:
* Information is often outdated
* manually curated once, version information, provenance, broken links
* Information is often incomplete
* maintenance status, documentation
* Information is duplicated and not aligned
* multiple metadata schemes

## Introduction (2)

* We developed a pipeline for **automatic** software metadata harvesting.
* We **periodically** harvest software metadata from the source
* This is not another software catalogue system; we aim to feed data to
various catalogues

## Bottom-up harvesting from the source (1)

![](archslide1.png){ height=40% }\

* Software metadata is kept at the source
1. Metadata lives alongside the source code in a version controlled repository (e.g. git) and published in a forge (e.g GitHub).
2. (SaaS) metadata provided directly by web endpoints

**Advantages:** Full authorship with the developers, versioned, no duplication, no intermediaries

## Bottom-up harvesting from the source (2)

![](archslide2.png){ height=40% }\

* A minimal manually-curated *source registry* contains all repositories/endpoints to harvest
* Our harvester periodically checks all of these.
* For each, it finds and collects various existing and heterogeneous software metadata schemes
* Reuse what developers already use, avoid any duplication of metadata.
* (e.g `pyproject.toml`, `project.json`, `pom.xml`, badges in `README.md`, `LICENSE`, `AUTHORS`)

## Bottom-up harvesting from the source (3)

![](screenshot_metadata.png){ height=70% }\

Many instances of metadata can be found!

## Metadata conversion to codemeta

![](archslide3.png)

* Existing heterogeneous software metadata schemes are converted into a uniform linked-open-data representation
* CodeMeta and schema.org as our base vocabulary
* Additional vocabularies on top (e.g. spdx, repostatus.org)
* The result of the conversion is a single `codemeta.json` file (JSON-LD)
* Developers may provide a complete `codemeta.json` or an incremental `codemeta-harvest.json`

## Publishing in catalogues

![The full architecture](architecture.png)

**Use-case:** CLARIAH (114 source repositories, 34 web endpoints)

## Publishing in catalogues (1)

![Our own catalogue frontend](screenshot.png)

## Publishing in catalogues (2)

![Our own catalogue frontend](screenshot2.png)

## Publishing in catalogues (3)

![CLARIAH's INEO frontend](screenshot3.png)

## Challenges

* Consolidating distinct vocabularies
* Extra domain-specific vocabulary
* e.g. TaDiRaH for research activities
* NWO Research Domains
* Technology Readiness Level
* Validation & Curation

## Validation & Curation (1)

* How to allow curation/collaboration?
* Forges like GitHub have solid contributor mechanisms in-place
* How to assure metadata quality?
* Harvested metadata is evaluated against a schema (SHACL) that tests
metadata fields (completeness, accuracy).
* The schema is made on the basis of carefully formulated
*software metadata requirements*
* The result of the evaluation is an automated compliance report
and a compliance score on a 0-5 scale.
* The report and score itself become part of the metadata.
* Set a threshold for propagation to catalogues

## Validation & Curation (2): Metadata Requirements

![Excerpt of the CLARIAH metadata requirements](screenshot_requirements.png)

## Validation & Curation (3): Validation report

![An automatically generated validation report](screenshot_report.png)

## Validation & Curation (4): Findings

* Human compliance is the biggest hurdle
* Hard to get developers to provide extra metadata
* Metadata compliance ranking in CLARIAH (114 tools):
* 5 stars: 2%
* 4 stars: 23%
* 3 stars: 45%
* 2 stars: 7%
* 1 stars: 19%
* 0 stars: 4%

## Questions?


Binary file added papers/screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added papers/screenshot2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added papers/screenshot3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added papers/screenshot_metadata.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added papers/screenshot_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added papers/screenshot_requirements.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit cb42850

Please sign in to comment.