-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharing large data objects #428
Comments
Thanks for putting this together. I realize from what you wrote that I like "ranged" (e.g., "ranged properties") as a keyword for all this.
If we go with this, the specification should probably say that these prefixes go after the database-specific prefix if present, e.g.,
This is an interesting extension of the design. We were already discussing in #377 what to return for "ranged properties" when no range was provided. So, I would suggest to throw all the meta-data needed to handle these properties into those responses, instead of having separate properties as you suggest above, e.g., for "ranges" and "index groups". So, very much in line with your own design in #377: the user first just sends a standard OPTIMADE query. Some members one get back somehow indicate that they are ranged, and directly in the response you get all the meta-data you need to query over those ranges. Based on that, a subsequent query with added OPTIMADE range parameters now give you a response where these keys instead map to actual data. Note: this is meant to just as closely as possible reflect your own design in #377. So, something like this. A first regular query to the trajectory endpoint gives:
(Note how I implicitly have suggested here that Looking at this response, you may, e.g., send a follow-up query with, e.q., the added REST query parameters:
This leads to very sensible syntax for querying either on the meta-data or ranged data.
Finally, I suggest to take a look if you can cut down the options of the |
The bit of remaining data in "index_groups" could indeed be moved to the individual properties. We would only duplicate the length of the range, which is not much data compared to the whole property.
Yes, that is correct. In the example you give, the "range_id" seems to be a good name for what I called "index group". The query parameters you suggest for retrieving ranges (range_frames_start and range_frames_count) are different from the ranges query parameter I suggested. Is there a reason you prefer these ? I was not planning to support queries on the data of the properties itself, as this would often be way too much information to process. We could add it in a future PR if there is a demand for it. Considering the "serialization_format", we could perhaps drop constant, as I am not sure whether it adds that much compared to the normal way we declare variables. You could argue that constant is the default value for a property and that only when it is not constant you turn it into a ranged property. You could also use the linear option and set the step_size_linear to zero, so you effectively also have a constant value. |
I'm not sure I follow this - if you mean that one should do something different from what I wrote as an example, can you edit the example?
Apologies, I missed how you meant that as the query parameter syntax. While your syntax is more compact, it is more complex and will require actual parsing on the server side; maybe even a grammar... Is there any real benefit in trying to overload the query parameters this way other than to save URL space? By separating it up in multiple query parameters, it is also easier to specify that, e.g., Furthermore, if we want to standardize
The complexity overhead for the specification in including this as an optional feature is minimal. It is going to be a single sentence. It also makes sense to allow it, when thinking of possible uses of ranged data outside of MD frames.
These are good - they apply to the whole data, and thus IMO should go in the response for the property key when not requesting a range. That makes it very easy to add your own data-base-specific aggregated properties. The only thing I don't see is a way to ask for, e.g., the average over a sub-range of data, but I'm OK with that feature not being supported.
Good. But in your opinion, all the others are needed? I'll take a look through them to see if I agree. As I said, if someone can map one into another cheaply on-the-fly, then both are not necessary. |
Here, I have placed what I think should be returned for the ranged fields when the fields are not specified with the range parameter.
The way I formulated it now probably does require a grammar update. The advantage over the method, you suggested/in the original trajectory proposal, is that you can specify the range per property, and that it also specifies for which properties you want to retrieve all the data. In the old design, we would have to do multiple queries if we would want to use multiple different ranges. Each time, listing the appropriate fields under the
I am not sure what you mean exactly with "range_frames_stepsize". I did think about what to do in case of sparse data when implementing the original trajectory proposal.
As long as it is fully optional, we could add it, although I wonder whether it will be implemented.
Yes, that is the idea. My idea is that OPTIMADE is for data retrieval and not analysis, so calculating the averages over parts of the range does not seem like a core task of OPTIMADE for me.
Now I think of it, we could also drop "explicit". You could use "explicit_regular_sparse" instead, and then specify the |
In the metadata PR #463 that was recently merged, we currently do not allow metadata that applies to the whole entry. |
@JPBergsma IMO the specification is good as it is in this regard. |
I was just wondering whether the fields in partial_data_links should have property definitions, as you wrote you did not yet know where these should go for fields directly under meta. |
|
Ok, then we can leave it like this. |
I do not see why this issue cannot be closed with the merging of #467? If I don't hear anything here I will close it. |
During the OPTIMADE meeting of 21-10-2022 and in the discussion on the trajectory endpoint #377, the idea was raised to have a more universal way of sharing large data objects that may be too large to comfortably fit in a single JSON response. This would require a way to return such properties across multiple responses and to specify which parts a user wants to retrieve.
Below I have given a description of how we could implement this. In some ways it is still quite minimalistic, and we could perhaps also think about how we would like to present datasets/tables in general.
I look forward to your feedback and once we agree on the rough outline sketched here, I will write a proposal with all the details to add it to the Optimade specification.
Marking
The user should somehow know which properties of an entry can be retrieved in pieces, i.e. are indexable.
One option is to have a field to list all the properties that can be indexed, for example:
indexable_properties: [cartesian_site_positions, _exmpl_kinetic_energy, _exmpl_potential_energy]
Another option would be to use a common prefix as suggested by @rartino for all properties, for example:
series_cartesian_site_positions
orindexed_cartesian_site_positions
Edit: Would such prefixes not interfere with the database specific prefixes?
What are your opinions on this ?
Indexing
In general, we cannot assume that all the properties use the same index, as was the case for the trajectories where the frame number was the common index. (i.e. two datasets may be present in one entry. For example, a trajectory from a Monte Carlo simulation which uses the Monte Carlo Step as the index and a set of radial distribution functions where the index corresponds to a distance from the central atom.)
Therefore, we should have a mechanism to specify for each property which range should be retrieved.
One way to do this would be to have a ranges query parameter :
ranges = cartesian_site_positions[1, 2000, 10], _exmpl_kinetic_energy[1, 2000], _exmpl_potential_energy[1, 2000]
The 1st value would be the index of the first value that should be returned, the 2nd value the index of the last value that should be returned and the 3rd value would be the step size.
We could also allow a more concise writing style where we allow multiple names per range such as:
ranges=cartesian_site_positions[1, 2000, 10], (_exmpl_kinetic_energy, _exmpl_potential_energy, _exmpl_constraint_force)[1, 2000]
Index_groups
To be able to know if the index of one property corresponds to with the index of another property, we need to know if both properties have the same index.
We therefore need to define an index for each group of correlated properties.
For example:
Index_groups: [{“name”: “trajectory”, “n_indexes”: 20000}, {“name”:“radial_distribution_function”, “n_indexes”:1000}]
Property descriptors
Similar to how we defined properties in the trajectory proposal, we could define each indexable property as a dictionary.
Here, we could reuse many of the properties already defined for the trajectory endpoint in PR#377 to describe how the values belong to the index. For brevity, I have not included all the details, but they can be found in PR#377.
serialization_format:
serialization_format
.offset_linear
frame_serialization_format
is set to :val:"linear"
this property gives the value at frame 1..1.5
step_size_linear:
frame_serialization_format
is set to :val:"linear"
, this value gives the change in the value of the property per unit of frame number.e.g. If at frame 3 the value of the property is 0.6 and :property:
step_size_linear
= 0.2 than at frame 4 the value of the property will be 0.8.0.0005
offset_sparse:
frame_serialization_format
is set to :val:"explicit_regular_sparse"
this property gives the frame number to which the first value belongs.100
step_size_sparse:
frame_serialization_format
is set to :val:"explicit_regular_sparse"
, this value indicates that every step_size_sparse frames a value is defined.100
sparse_frames:
frame_serialization_format
is set to :val:"explicit_custom_sparse"
, this field holds the frames to which the values in the value field belong.[0,20,78,345]
values:
The format of this field depends on the property and on the :property:
frame_serialization_format
parameter. It is only returned when the variable name is present in “ranges”.In addition to these properties already defined in the trajectory proposal, we would also need to define which “index” the property uses.
This property indicates which index belongs to this property. It matches one and only one of the values in the indexes property.
For example: "index_group":“radial_distribution_function”
Sub properties for Querying
It would take too much time to perform queries on all the values of a large property. Therefore, a number of sub properties can be defined which can be used to perform queries and make selections. Whether it makes sense to have these properties depends on the context and variable type, so I think they should all be optional.
Such properties are:
The text was updated successfully, but these errors were encountered: