Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ranged properties #452
Add ranged properties #452
Changes from 10 commits
c18cf0c
7dc50f2
6afcf9c
1a55c2a
b597a67
37db878
a832751
16aba07
b1d69a8
edbfc25
14de45d
906db81
1feb4a9
a96dffe
15f599c
73905dc
c6834f3
b0cc94c
0cee1e6
d7c8a9c
d1e8d74
139c70e
916d6f2
ee3651e
1f794b6
169a1f4
f513596
65b8ad1
94db38c
033ea11
3762d30
8332567
a87b301
7e9b4f4
4b298c5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do I understand correctly that the present design is for the member
values
in the "metadata dictionary" to also contain the actual data values? (So, in a sense, "metadata" is mislabeling it?)So, correlated with the thoughts above about pagination: how about making responses always contain two fields for ranged properties, one metadata field with only metadata, and one for the data. That way, even the initial response that contain a ranged property can pass the first page of data the way it would if this was just a regular (non-ranged) property. This means that if the ranged property happens to be small, the client gets all the data it needs in the first response without having to do anything else. Only in the case all data does not fit in one response does the client need to engage with the specifics of accessing ranged properties. This will actually greatly help collections (#386): I felt we were going to end up with an ugly design to force clients to fiddle with range access for all collections, when in many cases of smaller collections this would be unnecessary.
So, I propose:
values
into its own field; i.e., edit the initial text to say that for ranged properties two fields are returned:<property_name>_ranged_meta
: a dictionary with all these metadata fields.<property_name>_ranged_data
: a list of values of any type.The server MAY support querying on either or both.
For the field
<fieldname>_ranged_meta
all the fields you defined here are kept with the same meaning, but two new ones are needed (and perhaps these should be the only ones MANDATORY):The field
<fieldname>_ranged_data
would then be defined as:more_data_available
isfalse
in<fieldname>_ranged_meta
. If<fieldname>_ranged_meta
istrue
, then further pages of data can be fetched by the client via the URL innext
in<fieldname>_ranged_meta
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the dictionary would also contain data when the property is listed within the
property_range
query parameter so the term "metadata dictionary" is misleading.I do like that all the data is grouped together in a single field. We could also make a separate sub dictionary to group the metadata separate from the data. But overall, it is not that important how the fields are organized.
I am not sure, I like your proposal to return the first “page” of data of a ranged property by default.
The first request will typically be for a client to check which entries are present. It can then select the most interesting entries. And retrieve the additional data for these entries.
With your proposal, the number of entries that a server can return in a single response would be much smaller, as most of the response would consist of the first “pages” of properties.
If a user wants the behaviour you described, they could simply use the
property_range
query parameter to specify that they want the first x values of a property returned.If a client would retrieve only a single property, it would make more sense to already provide some of the data, but having a different behaviour for the single entry endpoint could also be confusing/complicate things.
Another solution could perhaps be to define the collection field flexibly so that you could have a regular and a ranged version of this property. I already wrote that if the part after the ranged prefix is the name of a regular property, it should meet all the requirements of that property.
In the JSON API v1.0 links is a reserved key word, so we would have to give this field a different name. (or drop support for json API 1.0) see section 5.2.3 of the JSON API specification.
It is quite a big change you are suggesting. Instead of retrieving the data per entry, you want to retrieve data per property. A bit as if each property is a resource in itself.
I do not think returning errors to the user is very nice when the user can’t know in advance that something is wrong.
So we should somehow specify how much data the server can return, and then the client should construct a request that fits between these bounds.
But this is difficult as the size of the properties differs from entry to entry and is unknown to the client.
So I do think that the server should implement pagination as it has all the data needed to do so while the client has not.
We should probably also think about creating streaming responses, so sending back large data items is less odd to the client. Now we first prepare the response and then send it. But for large responses this would be odd as the user does not get any data back for a while. If we do this, we can send larger responses, and we would not need ranged properties as often. But this is probably more of an issue for the optimade python tools rather than a change in the optimade specification.
Sara and Gian-Marco have asked me to prioritize writing a tutorial for the old trajectory proposal, so it may take a while before I get back to this topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the text of this PR presently say this, i.e., that the behavior of
value
depends on the query parameterproperty_range
? What I see in the text is: "The propertyvalues
MUST be present whenserialization_format
is not set to"linear"
."I think there is a point in separating data and metadata, especially when we consider the possibility of searching for data. But I agree that whether the data goes inside or outside the dictionary is not the most crucial design decision. Perhaps we can discuss it on the web meeting.
The first page doesn't need to be very large, it is up to the implementation - it could be, say, 20 values. If the client wants the data it asks for the field (and should then be happy to also get the data), if the client does not want the data it should not ask for the field.
I'm trying to move this in the direction where implementations that do not support random access of the data does not have to implement
property_range
at all, which is fairly complex to parse the way it is currently defined.I hadn't even considered that the way you've written this is meant to support a range selection over every entry that matches a filter. I do not think implementations should be forced to support that even if they do support range selection over a single entry.
Can you cite the part of the JSON:API specification that you read to say one cannot have
links
as a key in a dictionary insideattributes
?No, that is not what I meant, sorry if I was unclear. This was meant to be a small adjustment in behavior - the URL in links is meant to give you a single entry response with the next page of data for every property using the same range id as the one your are paging in. You can implement it by a URL that adds "property_ranges" selecting the next range (but you don't have to). I see now that I had missed how you want to allow multidimensional indexing with multiple range_ids; one would in that case have to work out how paging works in the multidimensional case. (But, do we need to support multidimensional ranges?)
Well, as you also conclude, that cannot generally be done given the degrees of freedom supported by
serialization_format
. So, sure, if you want to specify that IFproperty_range
is supported and the client asks for more results than the server can/wants to return, then the server also MUST allow paging those ranged responses with the paging mechanism I proposed, I'd be ok with that. As long as there is an option not to supportproperty_range
at all.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 526/527 it is mentioned that the values are only returned when the property is listed in the property_ranges query parameter.
I have added an extra line to the description of the values field to make it a bit clearer.
In that case, the client would still need a way to specify which fields it wants to have. So I think you would still need the property_range query parameter.
How would you make it easier to parse ?
Most of the code for the single entry endpoint and the multiple entry endpoint is the same. So I do not think this makes the implementation more difficult. It may actually be more work to create different behaviour for the single and multiple entry points.
“Complex data structures involving JSON objects and arrays are allowed as attribute values. However, any object that constitutes or is contained in an attribute MUST NOT contain a relationships or links member, as those members are reserved by this specification for future use.”
So, if I understand it correctly, you want to give each resource/entry its own next link to retrieve the remainder of this entry, and use the standard JSON API next link to point to the next entry that matches the filter.
I do not use the range_id for retrieving the data.
You use the name of the field in the query parameter property_ranges. So you can define a separate range for each property.
The current proposal already supports paging in multiple dimensions. (The range_ids have nothing to do with indexing. It only shows that two properties use the same range, so that values with the same index belong to each other.)
I mostly started out thinking how I could make sure that the proposal was compatible with a future extension to multiple dimensions. And when I had figured that out, I thought I could just as well write it out in the specification.
I could leave out the references to multiple dimensions, although in that case some people may wonder why there are some extra pairs of square brackets needed.
Some people have also asked me whether it would be possible to retrieve the cartesian_site_positions only for specific atoms. The current specification does not yet allow that, but I think the property_ranges query parameter could be extended to allow specific indexes. We could, for example, make a convention that the list that now contains the first last and step parameters could be replaced with a list of the indexes, placed between “(“ and “)”, for which data should be retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what you mean here, sorry. Which fields to return in a response is requested via
response_fields
?I'm trying to propose to make it so that the single mandatory feature to get all data in a ranged property is via a next-like paging system, which wouldn't require interpreting
property_range
at all.Oh, wow, I had missed this completely. It is a really strange rule to have for attribute data at arbitrary depth, and it seems they removed it in v1.1. Well, ok then, you are right that we would have to name that inner field something different than
links
. Perhaps just place a URL in a member namednext
.No, the way you describe it does not sound right. Lets me try again.
The standard JSON API next link would work as it always has: it pages over a range of returned entries. This is an absolute necessity to adhere to JSON:API.
A ranged property would, among its metadata, communicate a "special" next link. Following that next link would give you that entry (that whole entry with all attributes/properties as requested by
response_fields
) but where the "data" part of the ranged property contains the next range of values. This is the most straightforward way (and thus easiest to implement) I can imagine to communicate paged ranged data values within a fully valid OPTIMADE/JSON:API response.Maybe the most clear way to express it is that this "feature" is meant to be completely compatible with your design of
property_ranges
. Someone who has implementedproperty_ranges
gets precisely the right thing by simply constructing that specialnext
link by adding aproperty_ranges
parameter that specifies the next page of data.Why would this extension belong in
property_ranges
? If so, one could only use it for ranged properties? It seems to me it belongs inresponse_fields
or as its own query parameter, where one could imagine various ways to extend it to ask for specific subsets of a property.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the discussion on the trajectory endpoint, it was concluded that it would be confusing to determine whether the values should be returned based on whether the property was present in the response_fields. If the whole property (metadata + data) is present under a single field, there would be no way to specify with the response_fields parameter that you only want the metadata.
So instead we decided to trigger returning the data by the presence of the query parameters first_frame, last_frame and frame_step.
It therefore made sense to me to do the same here with the property_ranges query parameter.
I now understand why you wanted to split the data and metadata into two separate fields. That way, you could add the data field to the response_fields to get the data. And not return it otherwise.
If we however add by default a next link to the metadata, it would not be necessary to have a separate field for the data.
Ok, I think I understand you now. I’ll adjust in the proposal.
The cartesian_site_positions field is an example of multidimensional data.
In a trajectory the first dimension would cover the frames, the second dimension the particles and the third the dimensions(x, y, z).
In a PDB file, the atoms belonging to the protein usually come first, while the solvent atoms come later. So one could make a query, selecting only a part of the range in the second dimension, so that only the positions of the atoms of the protein and not those of the solvent would be returned.