-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ranged properties #452
Add ranged properties #452
Conversation
…e complicated when multiple dimensions are involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see a PR of this, thanks for the work with writing this up.
I find the formatting a bit strange. Most of the text is in section 3 for general API conventions, where all other sections just write out explanations of API features in regular text. This PR adds something formatted as a property definition, even though it is not. And the list of fields is also not formatted as similar lists elsewhere in the specification. I try to explain in my review comments, but perhaps also look at how the other sections explaining similar features elsewhere in the specification is formatted?
Another question is whether we should hold this back until v1.2 is out. I lean towards "yes". Any thoughts @ml-evs?
optimade.rst
Outdated
Likewise, the server can use paging to return the property in multiple parts. | ||
This can be useful for properties that are so large that it can be inconvenient to return them in a single response. | ||
In that case a link is provided, as described in `JSON Response Schema: Common Fields`_ under the `links.next` field, from which the remainder of the requested data can be retrieved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how paging works on ranged properties. The JSON:API field links.next
typically pages in lists of entries when the client fetch more than one entry, e.g., /structures
. Is the idea that you can only page a ranged property this way when requesting a single entry, e.g., /structures/142
? I think this may conflict with JSON:API's definition of links.next
. At least it needs to be described better how this works and what limitations apply.
My suggestion is to, at this point, not allow paging of range properties, only retrieval of ranges via OPTIMADE-specific query parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single entry with the data of a trajectory can easily be gigabytes in size.
Returning all that data in one response would give a long response time.
So we need a way for the server to limit the size of a response. (Perhaps we could create some kind of streaming response, but I am not sure how the different JSON tools will handle this. We may need to create our own JSON parser.)
The server can put an extra field in the next link that specifies to which extent the different properties have already been sent. They can slice the property in the same way the client does.
There is, however, no need to define how servers should do this as this only applies to their own server, so they can do what they want as long as the client gets the correct data.
I could add a field to the response to make it clearer what the server has returned. e.g. which range of values has been returned, so it is easier for the client to reconstruct the entry.
We could add a field with a name different from next to the links section, but I am not sure if that is any clearer.
Maybe paging is not the right word to use here. I will adjust the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I was unclear: I did not mean to question that range properties need to be fetched in parts.
I suppose you can refer to any mechanism for fetching partial data as "paging". However, in JSON:API paging (or actually, the specification uses "pagination") via the top-level "links object" is described as for servers that choose to "limit the number of resources returned" when returning a collection of resources as a result. I do not see a JSON:API-conforming way to "extend" that top-level links object the way I think you suggest (?)
Hence, I thought we were going to only allow fetching ranged data in parts via query parameters specifying ranges. However, I think you are right that some form of JSON:API links
-object type paging would be better as the MANDATORY baseline mechanism. That way all the new query parameters for ranged data can be made OPTIONAL, which is perfect for us who want to support MD data but have no use case for arbitrary indexing into frames (and thus do not want to put any effort into supporting that).
So, we just need to specify a place for that range-specific a JSON:API links object. But, where? Again, I don't think it can go in the top-level links
object. So, what about inside the ranged property metadata dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a bit strange to place the links in the ranged property metadata dictionary. So far I considered the trajectory as one resource, so I would expect that the next link applies to the whole resource and not just one property. So I would suggest placing the link a level higher at the level of the entire trajectory resource.
I am not sure why you suggested placing the link with the individual property.
I guess it is possible to turn the ranged properties into resources on their own. In that case the trajectory entry would include links to these resources like we do for the references.
As long as the server does not support the Include parameter, this could work as well. In that case, the client would need to retrieve the data for each individual property with a separate request. And if a property is too large, we could place the links to retrieve the rest of the property within the property attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think the new query parameter can be OPTIONAL. You still need a way to indicate which ranged_properties need to be returned.
Added suggestion from @rartino Co-authored-by: Rickard Armiento <[email protected]>
optimade.rst
Outdated
- **Type**: dictionary with keys: | ||
|
||
- :property:`serialization_format`: string (REQUIRED) | ||
- :property:`values`: List of any data type (OPTIONAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do I understand correctly that the present design is for the member values
in the "metadata dictionary" to also contain the actual data values? (So, in a sense, "metadata" is mislabeling it?)
So, correlated with the thoughts above about pagination: how about making responses always contain two fields for ranged properties, one metadata field with only metadata, and one for the data. That way, even the initial response that contain a ranged property can pass the first page of data the way it would if this was just a regular (non-ranged) property. This means that if the ranged property happens to be small, the client gets all the data it needs in the first response without having to do anything else. Only in the case all data does not fit in one response does the client need to engage with the specifics of accessing ranged properties. This will actually greatly help collections (#386): I felt we were going to end up with an ugly design to force clients to fiddle with range access for all collections, when in many cases of smaller collections this would be unnecessary.
So, I propose:
- Split out
values
into its own field; i.e., edit the initial text to say that for ranged properties two fields are returned:
<property_name>_ranged_meta
: a dictionary with all these metadata fields.
<property_name>_ranged_data
: a list of values of any type.
The server MAY support querying on either or both.
For the field <fieldname>_ranged_meta
all the fields you defined here are kept with the same meaning, but two new ones are needed (and perhaps these should be the only ones MANDATORY):
- 'links': a JSON:API object on the format described in the JSON:API specification section "Pagination", which is used for pagination of the ranged property.
- 'more_data_available': boolean. True if the data does not fit in a single page and this is not the final page.
The field <fieldname>_ranged_data
would then be defined as:
- If no range-selecting query parameters are given, it contains the first page of data (which is all of the data if and only if
more_data_available
isfalse
in<fieldname>_ranged_meta
. If<fieldname>_ranged_meta
istrue
, then further pages of data can be fetched by the client via the URL innext
in<fieldname>_ranged_meta
. - If any of the range-selecting query parameters defined below are given, then data matching that range selection is returned. The server MAY support paging also of the requested range, or MAY return an error if the requested range is too big to fit in a single response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do I understand correctly that the present design is for the member values in the "metadata dictionary" to also contain the actual data values? (So, in a sense, "metadata" is mislabeling it?)
Yes, the dictionary would also contain data when the property is listed within the property_range
query parameter so the term "metadata dictionary" is misleading.
how about making responses always contain two fields for ranged properties, one metadata field with only metadata, and one for the data.
I do like that all the data is grouped together in a single field. We could also make a separate sub dictionary to group the metadata separate from the data. But overall, it is not that important how the fields are organized.
That way, even the initial response that contain a ranged property can pass the first page of data the way it would if this was just a regular (non-ranged) property.
I am not sure, I like your proposal to return the first “page” of data of a ranged property by default.
The first request will typically be for a client to check which entries are present. It can then select the most interesting entries. And retrieve the additional data for these entries.
With your proposal, the number of entries that a server can return in a single response would be much smaller, as most of the response would consist of the first “pages” of properties.
If a user wants the behaviour you described, they could simply use the property_range
query parameter to specify that they want the first x values of a property returned.
If a client would retrieve only a single property, it would make more sense to already provide some of the data, but having a different behaviour for the single entry endpoint could also be confusing/complicate things.
I felt we were going to end up with an ugly design to force clients to fiddle with range access for all collections, when in many cases of smaller collections this would be unnecessary.
Another solution could perhaps be to define the collection field flexibly so that you could have a regular and a ranged version of this property. I already wrote that if the part after the ranged prefix is the name of a regular property, it should meet all the requirements of that property.
'links': a JSON:API object on the format described in the JSON:API specification section "Pagination", which is used for pagination of the ranged property.
In the JSON API v1.0 links is a reserved key word, so we would have to give this field a different name. (or drop support for json API 1.0) see section 5.2.3 of the JSON API specification.
It is quite a big change you are suggesting. Instead of retrieving the data per entry, you want to retrieve data per property. A bit as if each property is a resource in itself.
If any of the range-selecting query parameters defined below are given, then data matching that range selection is returned. The server MAY support paging also of the requested range, or MAY return an error if the requested range is too big to fit in a single response.
I do not think returning errors to the user is very nice when the user can’t know in advance that something is wrong.
So we should somehow specify how much data the server can return, and then the client should construct a request that fits between these bounds.
But this is difficult as the size of the properties differs from entry to entry and is unknown to the client.
So I do think that the server should implement pagination as it has all the data needed to do so while the client has not.
We should probably also think about creating streaming responses, so sending back large data items is less odd to the client. Now we first prepare the response and then send it. But for large responses this would be odd as the user does not get any data back for a while. If we do this, we can send larger responses, and we would not need ranged properties as often. But this is probably more of an issue for the optimade python tools rather than a change in the optimade specification.
Sara and Gian-Marco have asked me to prioritize writing a tutorial for the old trajectory proposal, so it may take a while before I get back to this topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the dictionary would also contain data when the property is listed within the property_range query parameter so the term "metadata dictionary" is misleading.
Where does the text of this PR presently say this, i.e., that the behavior of value
depends on the query parameter property_range
? What I see in the text is: "The property values
MUST be present when serialization_format
is not set to "linear"
."
I do like that all the data is grouped together in a single field. We could also make a separate sub dictionary to group the metadata separate from the data. But overall, it is not that important how the fields are organized.
I think there is a point in separating data and metadata, especially when we consider the possibility of searching for data. But I agree that whether the data goes inside or outside the dictionary is not the most crucial design decision. Perhaps we can discuss it on the web meeting.
I am not sure, I like your proposal to return the first “page” of data of a ranged property by default. The first request will typically be for a client to check which entries are present. It can then select the most interesting entries. And retrieve the additional data for these entries.
With your proposal, the number of entries that a server can return in a single response would be much smaller, as most of the response would consist of the first “pages” of properties.
The first page doesn't need to be very large, it is up to the implementation - it could be, say, 20 values. If the client wants the data it asks for the field (and should then be happy to also get the data), if the client does not want the data it should not ask for the field.
If a user wants the behaviour you described, they could simply use the property_range query parameter to specify that they want the first x values of a property returned.
I'm trying to move this in the direction where implementations that do not support random access of the data does not have to implement property_range
at all, which is fairly complex to parse the way it is currently defined.
I hadn't even considered that the way you've written this is meant to support a range selection over every entry that matches a filter. I do not think implementations should be forced to support that even if they do support range selection over a single entry.
In the JSON API v1.0 links is a reserved key word, so we would have to give this field a different name. (or drop support for json API 1.0) see section 5.2.3 of the JSON API specification.
Can you cite the part of the JSON:API specification that you read to say one cannot have links
as a key in a dictionary inside attributes
?
It is quite a big change you are suggesting. Instead of retrieving the data per entry, you want to retrieve data per property. A bit as if each property is a resource in itself.
No, that is not what I meant, sorry if I was unclear. This was meant to be a small adjustment in behavior - the URL in links is meant to give you a single entry response with the next page of data for every property using the same range id as the one your are paging in. You can implement it by a URL that adds "property_ranges" selecting the next range (but you don't have to). I see now that I had missed how you want to allow multidimensional indexing with multiple range_ids; one would in that case have to work out how paging works in the multidimensional case. (But, do we need to support multidimensional ranges?)
So we should somehow specify how much data the server can return, and then the client should construct a request that fits between these bounds.
Well, as you also conclude, that cannot generally be done given the degrees of freedom supported by serialization_format
. So, sure, if you want to specify that IF property_range
is supported and the client asks for more results than the server can/wants to return, then the server also MUST allow paging those ranged responses with the paging mechanism I proposed, I'd be ok with that. As long as there is an option not to support property_range
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the text of this PR presently say this, i.e., that the behavior of value depends on the query parameter.
On line 526/527 it is mentioned that the values are only returned when the property is listed in the property_ranges query parameter.
I have added an extra line to the description of the values field to make it a bit clearer.
I'm trying to move this in the direction where implementations that do not support random access of the data does not have to implement property_range at all, which is fairly complex to parse the way it is currently defined.
In that case, the client would still need a way to specify which fields it wants to have. So I think you would still need the property_range query parameter.
How would you make it easier to parse ?
I hadn't even considered that the way you've written this is meant to support a range selection over every entry that matches a filter. I do not think implementations should be forced to support that even if they do support range selection over a single entry.
Most of the code for the single entry endpoint and the multiple entry endpoint is the same. So I do not think this makes the implementation more difficult. It may actually be more work to create different behaviour for the single and multiple entry points.
Can you cite the part of the JSON:API specification that you read to say one cannot have links as a key in a dictionary inside attributes?
“Complex data structures involving JSON objects and arrays are allowed as attribute values. However, any object that constitutes or is contained in an attribute MUST NOT contain a relationships or links member, as those members are reserved by this specification for future use.”
No, that is not what I meant, sorry if I was unclear. This was meant to be a small adjustment in behaviour - the URL in links is meant to give you a single entry response with the next page of data for every property using the same range id as the one your are paging in. You can implement it by a URL that adds "property_ranges" selecting the next range (but you don't have to).
So, if I understand it correctly, you want to give each resource/entry its own next link to retrieve the remainder of this entry, and use the standard JSON API next link to point to the next entry that matches the filter.
I do not use the range_id for retrieving the data.
You use the name of the field in the query parameter property_ranges. So you can define a separate range for each property.
I see now that I had missed how you want to allow multidimensional indexing with multiple range_ids; one would in that case have to work out how paging works in the multidimensional case. (But, do we need to support multidimensional ranges?)
The current proposal already supports paging in multiple dimensions. (The range_ids have nothing to do with indexing. It only shows that two properties use the same range, so that values with the same index belong to each other.)
I mostly started out thinking how I could make sure that the proposal was compatible with a future extension to multiple dimensions. And when I had figured that out, I thought I could just as well write it out in the specification.
I could leave out the references to multiple dimensions, although in that case some people may wonder why there are some extra pairs of square brackets needed.
Some people have also asked me whether it would be possible to retrieve the cartesian_site_positions only for specific atoms. The current specification does not yet allow that, but I think the property_ranges query parameter could be extended to allow specific indexes. We could, for example, make a convention that the list that now contains the first last and step parameters could be replaced with a list of the indexes, placed between “(“ and “)”, for which data should be retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to move this in the direction where implementations that do not support random access of the data does not have to implement property_range at all, which is fairly complex to parse the way it is currently defined.
In that case, the client would still need a way to specify which fields it wants to have. So I think you would still need the property_range query parameter.
I don't get what you mean here, sorry. Which fields to return in a response is requested via response_fields
?
How would you make it easier to parse?
I'm trying to propose to make it so that the single mandatory feature to get all data in a ranged property is via a next-like paging system, which wouldn't require interpreting property_range
at all.
Can you cite the part of the JSON:API specification that you read to say one cannot have links as a key in a dictionary inside attributes?
“Complex data structures involving JSON objects and arrays are allowed as attribute values. However, any object that constitutes or is contained in an attribute MUST NOT contain a relationships or links member, as those members are reserved by this specification for future use.”
Oh, wow, I had missed this completely. It is a really strange rule to have for attribute data at arbitrary depth, and it seems they removed it in v1.1. Well, ok then, you are right that we would have to name that inner field something different than links
. Perhaps just place a URL in a member named next
.
So, if I understand it correctly, you want to give each resource/entry its own next link to retrieve the remainder of this entry, and use the standard JSON API next link to point to the next entry that matches the filter. I do not use the range_id for retrieving the data. You use the name of the field in the query parameter property_ranges. So you can define a separate range for each property.
No, the way you describe it does not sound right. Lets me try again.
The standard JSON API next link would work as it always has: it pages over a range of returned entries. This is an absolute necessity to adhere to JSON:API.
A ranged property would, among its metadata, communicate a "special" next link. Following that next link would give you that entry (that whole entry with all attributes/properties as requested by response_fields
) but where the "data" part of the ranged property contains the next range of values. This is the most straightforward way (and thus easiest to implement) I can imagine to communicate paged ranged data values within a fully valid OPTIMADE/JSON:API response.
Maybe the most clear way to express it is that this "feature" is meant to be completely compatible with your design of property_ranges
. Someone who has implemented property_ranges
gets precisely the right thing by simply constructing that special next
link by adding a property_ranges
parameter that specifies the next page of data.
Some people have also asked me whether it would be possible to retrieve the cartesian_site_positions only for specific atoms. The current specification does not yet allow that, but I think the property_ranges query parameter could be extended to allow specific indexes.
Why would this extension belong in property_ranges
? If so, one could only use it for ranged properties? It seems to me it belongs in response_fields
or as its own query parameter, where one could imagine various ways to extend it to ask for specific subsets of a property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what you mean here, sorry. Which fields to return in a response is requested via response_fields?
In the discussion on the trajectory endpoint, it was concluded that it would be confusing to determine whether the values should be returned based on whether the property was present in the response_fields. If the whole property (metadata + data) is present under a single field, there would be no way to specify with the response_fields parameter that you only want the metadata.
So instead we decided to trigger returning the data by the presence of the query parameters first_frame, last_frame and frame_step.
It therefore made sense to me to do the same here with the property_ranges query parameter.
I now understand why you wanted to split the data and metadata into two separate fields. That way, you could add the data field to the response_fields to get the data. And not return it otherwise.
If we however add by default a next link to the metadata, it would not be necessary to have a separate field for the data.
No, the way you describe it does not sound right. Let me try again.
The standard JSON API next link would work as it always has: it pages over a range of returned entries. This is an absolute necessity to adhere to JSON:API.
A ranged property would, among its metadata, communicate a "special" next link. Following that next link would give you that entry (that whole entry with all attributes/properties as requested by response_fields) but where the "data" part of the ranged property contains the next range of values. This is the most straightforward way (and thus easiest to implement) I can imagine to communicate paged ranged data values within a fully valid OPTIMADE/JSON:API response.
Maybe the most clear way to express it is that this "feature" is meant to be completely compatible with your design of property_ranges. Someone who has implemented property_ranges gets precisely the right thing by simply constructing that special next link by adding a property_ranges parameter that specifies the next page of data.
Ok, I think I understand you now. I’ll adjust in the proposal.
Why would this extension belong in property_ranges? If so, one could only use it for ranged properties? It seems to me it belongs in response_fields or as its own query parameter, where one could imagine various ways to extend it to ask for specific subsets of a property.
The cartesian_site_positions field is an example of multidimensional data.
In a trajectory the first dimension would cover the frames, the second dimension the particles and the third the dimensions(x, y, z).
In a PDB file, the atoms belonging to the protein usually come first, while the solvent atoms come later. So one could make a query, selecting only a part of the range in the second dimension, so that only the positions of the atoms of the protein and not those of the solvent would be returned.
Co-authored-by: Antanas Vaitkus <[email protected]>
…JPBergsma/OPTIMADE into JPBergsma/Add_ranged_properties # Conflicts: # optimade.rst
optimade.rst
Outdated
The third value specifies the step size. | ||
A list consists of a pair of square brackets ("[", ASCII 91(0x5B)) and ("]", ASCII 93(0x5D)) enclosing a number of values separated by a comma (",", ASCII 91(0x5B)) | ||
Ranges can be specified for multiple properties by separating them with a comma. | ||
Databases MUST return the :property:`values` and :property:`indexes` field belonging to properties listed and SHOULD use the ranges in this query parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some reformulation for clarity:
Databases MUST return the :property:`values` and :property:`indexes` field belonging to properties listed and SHOULD use the ranges in this query parameter. | |
When a client includes `property_ranges` in a request, the response MUST include the :property:`values` and :property:`indexes` field belonging to properties listed and SHOULD use the ranges in this query parameter. |
but the SHOULD here makes me nervous. I think supporting this parameter should be optional, but if you support it, don't you have to follow the indexing request?
Co-authored-by: Rickard Armiento <[email protected]>
I apologize for feature-creep on already long on-going work; but, after some more thinking I realize we have a number of properties already defined that would benefit from the ability to be "ranged", not the least, I think with some, actually quite minor, adjustment of this PR we can allow any property to be ranged on an entry-by-entry basis. If people agree this would be useful, I think this is how it could be done:
The above seems to me as a minor change to the PR with major utility going forward. |
62c6686
to
139c70e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comment (submitting because I've started to get errors about that other changes have been made)
optimade.rst
Outdated
@@ -455,15 +455,14 @@ Ranged properties are used for properties that are too large to be returned by d | |||
- **Requirements/Conventions**: | |||
|
|||
- **Support**: OPTIONAL support in implementations. | |||
- A ranged property can be recognized by the presence of the :field:`range` in the metadata of the property, i.e. in the field: <property_name>_meta. | |||
- A ranged property can be recognized by the presence of the :field:`range` in the metadata of the property, i.e. in the field: :field:`<property_name>_meta`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all fields are technically "ranged", I think the differentiation here has to be whether there is something along a field ranged/more_data_available
present and true
, meaning not all data is communicated in the current response. It must be allowed to have a ranged
dictionary even if that is not true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess you can make all fields ranged by setting the number of indexable dimensions to zero and return the whole property in one go, so the field behaves the same as the current fields. So you could still add a ranged subfield to the metadata even if a property does not support any of the new features.
There is already a more_data_available field that can be used to indicate whether the server has retuned all the data.
By setting the number of indexable_dimensions to 0, you as a server could also indicate to the client that it is not possible to perform slicing on the property. But perhaps it is cleaner to add a boolean to the metadata to define whether the supports that the client can request a subrange to retrieve.
Does this answer your question/remark ?
optimade.rst
Outdated
A list with an identifier for each dimension of the property. If two properties have the same range_id for a dimension it means that the values at an index along this dimension belong to each other. | ||
For example, if both the :property:`energy` and :property:`cartesian_site_positions` of a trajectory have the same :field:`range_ids` it indicates that the energy at an index x(in the dimension labeled by this range_ids) belongs to the cartesian_site_positions at the same index x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I understand now what you meant with "correlated" a few edits ago. I think we need to find a term for this, but one that hopefully is more clear. Maybe "synchronized"?
But, is your idea that one can get different ranges from different properties even for dimensions that are "synchronized"? I'd propose that one cannot. If asking ranges by range_id, all "syncronized" quantities would indeed be synchronized. (But otherwise "syncronized" is maybe not the best term)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure if my version is better - consider it a suggestion that you can accept or rework into your version for clarity.
A list with an identifier for each dimension of the property. If two properties have the same range_id for a dimension it means that the values at an index along this dimension belong to each other. | |
For example, if both the :property:`energy` and :property:`cartesian_site_positions` of a trajectory have the same :field:`range_ids` it indicates that the energy at an index x(in the dimension labeled by this range_ids) belongs to the cartesian_site_positions at the same index x. | |
A list with an identifier for each dimension of the property. | |
If any dimension in two or more properties share the same :field:`range_id` those dimensions should be thought of as the same dimension. | |
For example, if both the :property:`energy` and :property:`cartesian_site_positions` of a molecular dynamics trajectory share a :field:`range_id` of :val:`frame`, it should be interpreted as them providing, for each frame, synchronized values of the coordinates and the energy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, synchronous gives the impression that both are happening at the same time.
It however also applies to cases where the dimension is something other than time, so It is probably not the best word.
I am not entirely sure though what you mean with:
But, is your idea that one can get different ranges from different properties even for dimensions that are "synchronized"? I'd propose that one cannot. If asking ranges by range_id, all "syncronized" quantities would indeed be synchronized. (But otherwise "syncronized" is maybe not the best term)
If properties have the same range_id in a dimension, the values for those properties at the same index should belong to each other.
optimade.rst
Outdated
- :field:`range_ids`: list of strings. | ||
|
||
A list with an identifier for each dimension of the property. | ||
If two properties have the same range_id for a dimension, it means that the values at an index along this dimension belong to each other. | ||
For example, if both the :property:`energy` and :property:`cartesian_site_positions` of a trajectory have the same :field:`range_ids` it indicates that the energy at an index x(in the dimension labelled by this range_ids) belongs to the cartesian_site_positions at the same index x. | ||
SHOULD be a queryable property with support for all mandatory filter features. | ||
|
||
If the :field:`<property_name>` contains data, the following properties MUST be present or SHOULD NOT be present, depending on the value of the :property:`serialization_format`. Querying is not relevant for these properties and SHOULD NOT be supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should range_id
really be a per-entry metadata? Is that perhaps something we define as part of the property definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not yet been able to come up with an example where defining range_id
in the property definition would cause more problems than at the single entry level. So I think it should be possible to define it in the property definition.
I have updated the specification with the points we discussed during the online meeting.(with @rartino) I now use parentheses instead of square brackets as parethesis are already mentioned in the Grammar (I do not really know anything about this grammar so I do not know how hard or easy it is to make changes.) I have added a per entry next field as we no longer specify ranges per property but per dimension. The text can probably still be fine tuned a bit more. but for now I would be happy if we can agree on the specification sufficiently well that it could be implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work put into this.
I am generally very supportive of this version. I especially like how the property range parameter is much more readable now.
But, I would:
-
Appreciate someone else also takes a careful look to see if the present version makes sense also to them (since this is a result of you and I discussing).
-
It would be great if someone implement at least a minimal subset of this before it is merged into a version of OPTIMADE we release, since I expect little details on how things are defined may be much more clear then. I'm certainty noticing that effect when working with the property definitions now.
optimade.rst
Outdated
The total number of values in the property. | ||
SHOULD be a queryable property with support for all mandatory filter features. | ||
|
||
- :field:`next`: a `JSON API links object <http://jsonapi.org/format/1.0/#document-links>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had this trouble elsewhere in the specification and keep running into it. The reference to a JSON API links object in a section that is meant to be data-format agnostic doesn't really work. If someone is trying to represent the output data in HDF5 (apparently someone already implemented that...), what does a "JSON API links object" mean inside HDF5?
So, the easy solution (suggested below) is to skip the extra degrees of freedom provided by JSON API links objects when we have links appearing in data-agnostic output, and just say that it is a string with a URL. However, given how often we are running into this issue, I'm starting to wonder if we rather should extend the base OPTIMADE data types with URL, so that the JSON API output format section generally can say that when the OPTIMADE specification says that something is "a link", you are meant to put a JSON API links object there.
- :field:`next`: a `JSON API links object <http://jsonapi.org/format/1.0/#document-links>`__. | |
- :field:`next`: String. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we define URLs in many places as a JSON API links object
, like the home page field for providers in the meta field, for example. So it seems logical to do it here as well. I will change the description a bit, so it is more like the others.
I however do think that the values we assign are often not JSON API links objects. But instead, values that would qualify as members for JSON API links objects. The value of such a member can also be a string, so it should not be any more problematic to store as the dictionaries we currently have.
Should I make a separate PR to change JSON API links objects
to something like a valid member for a JSON API links object
elsewhere in the specification ?
Ps. I have written code which outputs the response in the hdf5 format. You can test it with https://optimade-trajectory-demo.matcloud.xyz/v1/structures?response_format=hdf5. Unfortunately, h5py does not support storing a list of dicts, so I have to convert this to a dict of dicts.
optimade.rst
Outdated
|
||
In addition to these fields in the metadata, entries which support accessing data via the :query-param:`property\_ranges` query parameter SHOULD support per entry :field:`next` and :field:`more_data_availble` field to enable returning all the | ||
|
||
- :field:`next`: a `JSON API links object <http://jsonapi.org/format/1.0/#document-links>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- :field:`next`: a `JSON API links object <http://jsonapi.org/format/1.0/#document-links>`__. | |
- :field:`next`: String. |
Co-authored-by: Rickard Armiento <[email protected]>
I guess I should implement the meta property first, although that should not be that much work. |
- :field:`stop`: integer. | ||
The index of the last value of this property in this dimension. | ||
|
||
- :field:`contains_null`: boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "contains_null" not be a more generic metadata property that could also apply to other non-ranged fields ?
Ps. This would also allow querying on whether a property contains_null, now that I have placed in the description that support for queries on the fields under "range" is optional.
…of ranged properties is Optional.
…planation for property_ranges query parameter.
The adding of the partial_data mechanism PR#467 has made this PR largely obsolete, so I will therefore close it for now. |
In This PR, I make a suggestion for how large “ranged” properties could be shared with OPTIMADE.
See the discussion in Issue#428
It also describes how this could be done for multidimensional properties.
I have now placed the main body of the text in section 3.9, I am however not sure if this is the best location. Feel free to suggest another location in the text.
I have now used a prefix to indicate that a property is a ranged property. This however creates a precedent, as it is the first time we use a prefix in this manner.If you prefer, I can also add a field that lists all the properties that are ranged properties, or mandate that this is specified at the info endpoint.Do we want to add a field in the property definition that a field is a ranged field, (also in case we choose to use a
prefix or aseparate property in the question before this one)?Now I have put a SHOULD level queryability not defined the queryability of all the dictionary fields except the values and indexes field.( I have not set all query ability to OPTIONAL)I do not think it is very important that these fields are queryable, so we could perhaps also make querying OPTIONAL.
Now the values can only depend linearly on the index. Do we want to define a more general field which could contain information on how to calculate the value from the index? This would be considerably more complex and probably warrants a PR of its own.Do we want to have a negative requirement on the range_id, i.e. If the range_id is not defined, a property must not have the same range as another property?For now, I removed properties average, min, max, and set as this becomes more complicated with multiple dimensions and I did not want to make this PR too big / complicated.
This can be addressed in a separate PR once this one gets merged.
Do we want to make the values for slicing optional?
The default values would then be [1,last,1]
So [] would return all the values, [100] would return all the values from the hundredth one onward and [30,50] would return every, value belonging to index 30 till 50.
Do we want to support multiple ranges for a single property? e.g. _ranged_property_name[1,100,2][101,200,1]