Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections endpoint #386

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
Open
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1088,6 +1088,7 @@ Example:
},
"available_endpoints": [
"structures",
"collections",
"calculations",
"info",
"links"
Expand Down Expand Up @@ -2808,6 +2809,90 @@ structure\_features

- A structure having implicit atoms and using assemblies: :val:`["assemblies", "implicit_atoms"]`

Collections Entries
-------------------

Collections entries are used to define a set of Entries of any types.
For example, a collection of Structure entries can be used to indicate that they are conceptually related, such as structures representing aluminium unit cells with point defects.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The set of entries that belong to the collection is defined using relationships from the collection to each entry (OPTIMADE relationships are defined in `Relationships`_).
rartino marked this conversation as resolved.
Show resolved Hide resolved
A collection can contain other collections.
Furthermore, implementations are suggested to add database-specific properties for additional metadata they want to store about the collections.
An OPTIMADE response representing a collection with all referenced entries included via the JSON API field :field:`included` (or equivalent in other response formats) can be used as a universal format for storage or transfer of a subset of (or all) data in an OPTIMADE database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about pagination of included values, in the case of e.g., 100,000 structures in the same collection. Do we need to worry about that? included is only an optional field at the moment.

Copy link
Contributor

@rartino rartino Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ml-evs

included is mentioned here only as a suggestion of the potential use of a collection as an export format - in which case the whole idea would be to put everything you want to export (e.g., all 100'000 structures) in the same stream. Nothing is said here that indicates mandatory support for included?

I thought we intended for clients to generally just get the list of ids from the relationship and then request entry data by further queries using the endpoint + id format. (Or, for efficiency when there are many, perhaps via filter=id=X OR id=Y OR id=Z OR ...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

included is mentioned here only as a suggestion of the potential use of a collection as an export format - in which case the whole idea would be to put everything you want to export (e.g., all 100'000 structures) in the same stream. Nothing is said here that indicates mandatory support for included?

I thought we intended for clients to generally just get the list of ids from the relationship and then request entry data by further queries using the endpoint + id format. (Or, for efficiency when there are many, perhaps via filter=id=X OR id=Y OR id=Z OR ...)

I understand that this is the intention, but I am a bit nervous that we have a field that can now grow unboundedly in a response if requested (or even if not, you cannot disable relationships from the response, I don't think?). I guess you could argue the same for a 1 million site structure object but here it feels well within the design that even the list of IDs could be very large.

I think the larger comment now addressed in #420 would be the best mechanism around this (if we are going to support it anyway).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit nervous that we have a field that can now grow unboundedly in a response if requested (or even if not, you cannot disable relationships from the response, I don't think?)

Are you talking about included or relationships now?; both of them can grow to arbitrary size - although - we are talking extremely large collections before relationships becomes unduly large.

Possibly repeating myself a bit, but to be clear: I don't see a problem with included. Implementations probably should avoid it except as recommended in OPTIMADE for references, unless a client somehow explicitly requests it (which we don't have a standard mechanism for yet). If an implementation decides to include included anyway, while simultaneously having "unboundedly large" relationships, it would be silly to not implement a limit on the number of entries included this way.

The situation is more tricky with huge relationships. I think JSON:API silently is built on the assumption that the list of IDs for all relationships of a resource must be small enough to handle in a single request. Sure, one can use the articles/1/relationships/comments syntax to get something paginated, but how does one know in the first place which JSON:API relationship keys to fetch without first fetching the unboundedly large articles/1?

Hence, I think we have to look at this list of IDs as a single "datum" where our default OPTIMADE JSON:API output format isn't equipped to handle arbitrary large data. This is echo:ed by our recommendation for other output formats to simply encode relationships alongside other data.

If we are concerned about this limitation, I don't see any other way to address it than to implement an alternative output format that can handle pagination on individual properties, including the relationships.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly repeating myself a bit, but to be clear: I don't see a problem with included. Implementations probably should avoid it except as recommended in OPTIMADE for references, unless a client somehow explicitly requests it (which we don't have a standard mechanism for yet). If an implementation decides to include included anyway, while simultaneously having "unboundedly large" relationships, it would be silly to not implement a limit on the number of entries included this way.

Understood

The situation is more tricky with huge relationships. I think JSON:API silently is built on the assumption that the list of IDs for all relationships of a resource must be small enough to handle in a single request.

I think my concern is that some of the intended use cases for collections might cross this boundary already (would 10,000/100,000 IDs to structures that define a training set break this?) I'm also not sure that relationships can be excluded from the request using response_fields, so you can't even hit /collections to browse/filter them without getting these potentially large responses. I understand that this is already the case with e.g., COD's mythical 100k atom structure, but at least you could choose which fields you wanted to be returned!

Sure, one can use the articles/1/relationships/comments syntax to get something paginated, but how does one know in the first place which JSON:API relationship keys to fetch without first fetching the unboundedly large articles/1?

I'm leaning towards this being the correct approach. The relationships can be included as a self link to articles/1/relationships/comments rather than as a data which I think solves your problem. Perhaps we could say something like. "It is RECOMMENDED that implementations use self links instead of explicit relationships for collections with a number of entries in significantly in excess of the implementation's page limit."

If we are concerned about this limitation, I don't see any other way to address it than to implement an alternative output format that can handle pagination on individual properties, including the relationships.

Let's see how the discussion in #419 goes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationships can be included as a self link to articles/1/relationships/comments rather than as a data which I think solves your problem. Perhaps we could say something like. "It is RECOMMENDED that implementations use self links instead of explicit relationships for collections with a number of entries in significantly in excess of the implementation's page limit."

You seem to be right - excluding the data key if it contains too many entries, and instead rely on a link that returns a paginated result set of the entries is a straightforward solution. So, if we try to follow the JSON API examples, I think this link should point to, e.g.,: /collections/42/structures and in reference to the ongoing discussion in #420 this would thus be the first "required" use of third-level URLs.


The following example shows how relationships from a collection to other entries defines the contents of the collection:

.. code:: jsonc

{
"data": {
"type": "collections",
"id": "example.com:collections:42",
rartino marked this conversation as resolved.
Show resolved Hide resolved
"attributes": {
"name": "Results set for vacancies in FCC Al"
"category": "results_set"
},
"relationships": {
"structures": {
rartino marked this conversation as resolved.
Show resolved Hide resolved
"data": [
{ "type": "structures", "id": "example.com:structures:4711" },
{ "type": "structures", "id": "example.com:structures:4712" },
{ "type": "structures", "id": "example.com:structures:4713" }
]
},
"calculations": {
"data": [
{ "type": "calculations", "id": "example.com:calculations:1899" }
]
}
}
},
ml-evs marked this conversation as resolved.
Show resolved Hide resolved
}

Collections entries have the properties described above in section `Properties Used by Multiple Entry Types`_, as well as the following additional properties:

name
~~~~

- **Description**: A name for the collection
- **Type**: String
- **Requirements/Conventions**:

- **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`.
- **Query**: Support for queries on this property is OPTIONAL.

- **Examples**:

- :val:`"Results set for vacancies in FCC Al"`
rartino marked this conversation as resolved.
Show resolved Hide resolved

description
~~~~~~~~~~~

- **Description**: A longer text that describes the collection
- **Type**: String
- **Requirements/Conventions**:

- **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`.
- **Query**: Support for queries on this property is OPTIONAL.

- **Examples**:

- :val:`"This collection contains structures used in an investigation into point defects in Al"`
rartino marked this conversation as resolved.
Show resolved Hide resolved

category
~~~~~~~~

- **Description**: A free-form text categorizing the collection.
It is suggested that individual collections with similar purposes are assigned the same category to aid browsing and searching.
- **Type**: String
- **Requirements/Conventions**:

- **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`.
- **Query**: Support for queries on this property is OPTIONAL.

- **Examples**:

- :val:`"results_set"`
rartino marked this conversation as resolved.
Show resolved Hide resolved

Calculations Entries
--------------------

Expand Down