Metadata utilities #473

jl-wynen · 2023-12-04T09:30:02Z

Overview

We need to handle metadata in a few places. And some of that metadata is common to many techniques and workflows. In particular, we may/will have SciCat metadata in many workflows. But there are also technique-specific file formats that encode metadata, e.g., NXCanSAS, ORSO, CIF. And they each use different names and layouts even though they have many fields in common.

To provide a common ground and unify metadata handling (at least to an extend) I propose adding common classes and maybe functions that

let users specify metadata,
parse input metadata (mainly SciCat) into a common form,
write output metadata to various formats.

ScippNeutron seems to me like a good place to put these utilities as it is used by all ess* packages that implement the actual workflows. We could of course also use a separate package but that seems a little overkill.

Suggested implementation

We can use Pydantic models to encode an validate metadata. For example, the following are commonly used:

class Author(pydantic.BaseModel):
    name: str  # free form even though some file formats require a specific format
    email: pydantic.EmailStr | None = None
    orcid: ORCID | None = None  # custom type, already implemented in Scitacean
    affiliation: str | None = None
    role: str | None = None  # CIF has a fixed set of allowed roles
    corresponding: bool = False  # aka 'contact'; formatted differently in some formats
    owner: bool = True  # owner of the data
   

class Beamline(pydantic.BaseModel):
    name: str
    facility: str | None = None
    site: str | None = None
    version: str | None = None  # or 'revision'?

And maybe

class Software(pydantic.BaseModel):
    name: str
    version: str
    url: str | None = None  # can be a DOI; or should there be a separate field?

class ProcessLog(pydantic.BaseModel):
    software: list[Software]
    steps: list[str] | None = None
    graph: str | None = None  # not neatly possible with most output formats
    log: str | None = None  # anything logged by the workflow; can be too long to embed in output files

The Beamline model could of course be extended to include technical info about the beamline / instrument such as flight path length, choppers, sample holder, etc. But the fields above are a minimum set of strictly metadata that should be common to all large scale facilities.

Example

jane = Author(
    name="Jane Doe",
    email="[email protected]",
    orcid="https://orcid.org/0000.0000.0000.0000",
    affiliation="Paul Scherrer Institut",
    corresponding=True,
)
amor = Beamline(
    name="Amor",
    facility="SINQ",
    site="Paul Scherrer Institut",
)

The text was updated successfully, but these errors were encountered:

SimonHeybrock · 2023-12-04T09:45:10Z

Do you have in mind that the classes you suggest would also be used as domain types for workflows written with Sciline? Or are they generally not specific enough?

jl-wynen · 2023-12-04T09:58:59Z

They could be used as domain types. That would make it easy for the output providers to request them. (Possibly from params, because some metadata cannot be extracted from the input, e.g., Author.)

nitrosx · 2023-12-06T13:42:53Z

I like the idea, and we need more brainstorming.
Would you load the metadata together with the metadata if possible?
Would you than map the fields of your types to each specific platform (aka SciCat) or file format (example ORSO)?

jl-wynen · 2023-12-06T13:54:07Z

Would you load the metadata together with the metadata if possible?

Do you mean 'load with the data'? We do this anyway to an extend with NeXus. We will likely have metadata providers for Sciline that extract metadata from the input (NeXus, SciCat).

Would you than map the fields of your types to each specific platform (aka SciCat) or file format (example ORSO)?

Yes, that is the idea! Basically, these types would be a sort of intermediate representation that can be constructed from different sources (SciCat, NeXus, user input, ...) and can be converted to specific representations (SciCat, ORSO, CIF, ...). So we can mix and match parsers and representers without implementing each combination separately.

jl-wynen · 2024-02-06T12:49:11Z

Judging by scipp/essreflectometry#27 (comment), we may also want some general input file tracking tools. Given that the sources for input data can vary (local file, pooch file, SciCat), it makes sense to have a way of getting common info such as

file name
checksum
creation time
storage location / mode of access

This probably factors into selecting inputs as well. See, e.g., scipp/essreflectometry#25

jl-wynen · 2024-02-07T14:12:36Z

Made an overview of metadata in the file formats we use: Common.Metadata.Schemas.md

jl-wynen added the enhancement New feature or request label Dec 4, 2023

jl-wynen mentioned this issue Dec 7, 2023

Add CIF writer #477

Merged

SimonHeybrock mentioned this issue Jan 31, 2024

Re-implement Orso file writing scipp/essreflectometry#6

Closed

jl-wynen mentioned this issue Feb 5, 2024

Reimplement ORSO filewriter scipp/essreflectometry#27

Merged

jl-wynen mentioned this issue Sep 5, 2024

Add a builder class for CIF files #548

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata utilities #473

Metadata utilities #473

jl-wynen commented Dec 4, 2023

SimonHeybrock commented Dec 4, 2023

jl-wynen commented Dec 4, 2023

nitrosx commented Dec 6, 2023

jl-wynen commented Dec 6, 2023

jl-wynen commented Feb 6, 2024

jl-wynen commented Feb 7, 2024 •

edited

Loading

Metadata utilities #473

Metadata utilities #473

Comments

jl-wynen commented Dec 4, 2023

Overview

Suggested implementation

Example

SimonHeybrock commented Dec 4, 2023

jl-wynen commented Dec 4, 2023

nitrosx commented Dec 6, 2023

jl-wynen commented Dec 6, 2023

jl-wynen commented Feb 6, 2024

jl-wynen commented Feb 7, 2024 • edited Loading

jl-wynen commented Feb 7, 2024 •

edited

Loading