Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata utilities #473

Open
jl-wynen opened this issue Dec 4, 2023 · 6 comments
Open

Metadata utilities #473

jl-wynen opened this issue Dec 4, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@jl-wynen
Copy link
Member

jl-wynen commented Dec 4, 2023

Overview

We need to handle metadata in a few places. And some of that metadata is common to many techniques and workflows. In particular, we may/will have SciCat metadata in many workflows. But there are also technique-specific file formats that encode metadata, e.g., NXCanSAS, ORSO, CIF. And they each use different names and layouts even though they have many fields in common.

To provide a common ground and unify metadata handling (at least to an extend) I propose adding common classes and maybe functions that

  • let users specify metadata,
  • parse input metadata (mainly SciCat) into a common form,
  • write output metadata to various formats.

ScippNeutron seems to me like a good place to put these utilities as it is used by all ess* packages that implement the actual workflows. We could of course also use a separate package but that seems a little overkill.

Suggested implementation

We can use Pydantic models to encode an validate metadata. For example, the following are commonly used:

class Author(pydantic.BaseModel):
    name: str  # free form even though some file formats require a specific format
    email: pydantic.EmailStr | None = None
    orcid: ORCID | None = None  # custom type, already implemented in Scitacean
    affiliation: str | None = None
    role: str | None = None  # CIF has a fixed set of allowed roles
    corresponding: bool = False  # aka 'contact'; formatted differently in some formats
    owner: bool = True  # owner of the data
   

class Beamline(pydantic.BaseModel):
    name: str
    facility: str | None = None
    site: str | None = None
    version: str | None = None  # or 'revision'?

And maybe

class Software(pydantic.BaseModel):
    name: str
    version: str
    url: str | None = None  # can be a DOI; or should there be a separate field?

class ProcessLog(pydantic.BaseModel):
    software: list[Software]
    steps: list[str] | None = None
    graph: str | None = None  # not neatly possible with most output formats
    log: str | None = None  # anything logged by the workflow; can be too long to embed in output files

The Beamline model could of course be extended to include technical info about the beamline / instrument such as flight path length, choppers, sample holder, etc. But the fields above are a minimum set of strictly metadata that should be common to all large scale facilities.

Example

jane = Author(
    name="Jane Doe",
    email="[email protected]",
    orcid="https://orcid.org/0000.0000.0000.0000",
    affiliation="Paul Scherrer Institut",
    corresponding=True,
)
amor = Beamline(
    name="Amor",
    facility="SINQ",
    site="Paul Scherrer Institut",
)
@jl-wynen jl-wynen added the enhancement New feature or request label Dec 4, 2023
@SimonHeybrock
Copy link
Member

Do you have in mind that the classes you suggest would also be used as domain types for workflows written with Sciline? Or are they generally not specific enough?

@jl-wynen
Copy link
Member Author

jl-wynen commented Dec 4, 2023

They could be used as domain types. That would make it easy for the output providers to request them. (Possibly from params, because some metadata cannot be extracted from the input, e.g., Author.)

@nitrosx
Copy link

nitrosx commented Dec 6, 2023

I like the idea, and we need more brainstorming.
Would you load the metadata together with the metadata if possible?
Would you than map the fields of your types to each specific platform (aka SciCat) or file format (example ORSO)?

@jl-wynen
Copy link
Member Author

jl-wynen commented Dec 6, 2023

Would you load the metadata together with the metadata if possible?

Do you mean 'load with the data'? We do this anyway to an extend with NeXus. We will likely have metadata providers for Sciline that extract metadata from the input (NeXus, SciCat).

Would you than map the fields of your types to each specific platform (aka SciCat) or file format (example ORSO)?

Yes, that is the idea! Basically, these types would be a sort of intermediate representation that can be constructed from different sources (SciCat, NeXus, user input, ...) and can be converted to specific representations (SciCat, ORSO, CIF, ...). So we can mix and match parsers and representers without implementing each combination separately.

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 6, 2024

Judging by scipp/essreflectometry#27 (comment), we may also want some general input file tracking tools. Given that the sources for input data can vary (local file, pooch file, SciCat), it makes sense to have a way of getting common info such as

  • file name
  • checksum
  • creation time
  • storage location / mode of access

This probably factors into selecting inputs as well. See, e.g., scipp/essreflectometry#25

@jl-wynen
Copy link
Member Author

jl-wynen commented Feb 7, 2024

Made an overview of metadata in the file formats we use: Common.Metadata.Schemas.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants