Sharing large data objects #428

JPBergsma · 2022-12-01T19:43:53Z

During the OPTIMADE meeting of 21-10-2022 and in the discussion on the trajectory endpoint #377, the idea was raised to have a more universal way of sharing large data objects that may be too large to comfortably fit in a single JSON response. This would require a way to return such properties across multiple responses and to specify which parts a user wants to retrieve.

Below I have given a description of how we could implement this. In some ways it is still quite minimalistic, and we could perhaps also think about how we would like to present datasets/tables in general.
I look forward to your feedback and once we agree on the rough outline sketched here, I will write a proposal with all the details to add it to the Optimade specification.

Marking

The user should somehow know which properties of an entry can be retrieved in pieces, i.e. are indexable.
One option is to have a field to list all the properties that can be indexed, for example:
indexable_properties: [cartesian_site_positions, _exmpl_kinetic_energy, _exmpl_potential_energy]

Another option would be to use a common prefix as suggested by @rartino for all properties, for example: series_cartesian_site_positions or indexed_cartesian_site_positions
Edit: Would such prefixes not interfere with the database specific prefixes?

What are your opinions on this ?

Indexing

In general, we cannot assume that all the properties use the same index, as was the case for the trajectories where the frame number was the common index. (i.e. two datasets may be present in one entry. For example, a trajectory from a Monte Carlo simulation which uses the Monte Carlo Step as the index and a set of radial distribution functions where the index corresponds to a distance from the central atom.)

Therefore, we should have a mechanism to specify for each property which range should be retrieved.
One way to do this would be to have a ranges query parameter :

ranges = cartesian_site_positions[1, 2000, 10], _exmpl_kinetic_energy[1, 2000], _exmpl_potential_energy[1, 2000]

The 1st value would be the index of the first value that should be returned, the 2nd value the index of the last value that should be returned and the 3rd value would be the step size.

We could also allow a more concise writing style where we allow multiple names per range such as:

ranges=cartesian_site_positions[1, 2000, 10], (_exmpl_kinetic_energy, _exmpl_potential_energy, _exmpl_constraint_force)[1, 2000]

Index_groups

To be able to know if the index of one property corresponds to with the index of another property, we need to know if both properties have the same index.
We therefore need to define an index for each group of correlated properties.
For example:

Index_groups: [{“name”: “trajectory”, “n_indexes”: 20000}, {“name”:“radial_distribution_function”, “n_indexes”:1000}]

Property descriptors

Similar to how we defined properties in the trajectory proposal, we could define each indexable property as a dictionary.
Here, we could reuse many of the properties already defined for the trajectory endpoint in PR#377 to describe how the values belong to the index. For brevity, I have not included all the details, but they can be found in PR#377.

serialization_format:
- Description: To improve the compactness of the data while still maintaining there are several ways to show to which frame a value belongs. Which method has been used is specified by the :property:serialization_format.
offset_linear
- Description: If :property:frame_serialization_format is set to :val:"linear" this property gives the value at frame 1..
- Examples:
- :val:1.5
step_size_linear:
- Description: If :property:frame_serialization_format is set to :val:"linear", this value gives the change in the value of the property per unit of frame number.
  e.g. If at frame 3 the value of the property is 0.6 and :property:step_size_linear = 0.2 than at frame 4 the value of the property will be 0.8.
- Examples:
- :val:0.0005
offset_sparse:
- Description: If :property:frame_serialization_format is set to :val: "explicit_regular_sparse" this property gives the frame number to which the first value belongs.
- Examples:
- :val:100
step_size_sparse:
- Description: If :property:frame_serialization_format is set to :val: "explicit_regular_sparse", this value indicates that every step_size_sparse frames a value is defined.
- Examples:
- :val:100
sparse_frames:
- Description: If :property:frame_serialization_format is set to :val:"explicit_custom_sparse", this field holds the frames to which the values in the value field belong.
- Examples:
- :val:[0,20,78,345]
values:
- Description: The values belonging to this property.
  The format of this field depends on the property and on the :property:frame_serialization_format parameter. It is only returned when the variable name is present in “ranges”.

In addition to these properties already defined in the trajectory proposal, we would also need to define which “index” the property uses.

index_group:
This property indicates which index belongs to this property. It matches one and only one of the values in the indexes property.
For example: "index_group":“radial_distribution_function”

Sub properties for Querying

It would take too much time to perform queries on all the values of a large property. Therefore, a number of sub properties can be defined which can be used to perform queries and make selections. Whether it makes sense to have these properties depends on the context and variable type, so I think they should all be optional.

Such properties are:

average i.e. the average value of a numerical property.
min i.e. the minimum value of a numerical property.
max i.e. the maximum value of a numerical property.
set i.e. a set of all the values the property takes.

The text was updated successfully, but these errors were encountered:

rartino · 2022-12-05T08:38:50Z

Thanks for putting this together. I realize from what you wrote that I like "ranged" (e.g., "ranged properties") as a keyword for all this.

Another option would be to use a common prefix as suggested by @rartino for all [indexable] properties. [...] Would such prefixes not interfere with the database specific prefixes?

If we go with this, the specification should probably say that these prefixes go after the database-specific prefix if present, e.g., _exmpl_ranged_cartesian_site_positions.

In general, we cannot assume that all the properties use the same index

This is an interesting extension of the design.

We were already discussing in #377 what to return for "ranged properties" when no range was provided. So, I would suggest to throw all the meta-data needed to handle these properties into those responses, instead of having separate properties as you suggest above, e.g., for "ranges" and "index groups".

So, very much in line with your own design in #377: the user first just sends a standard OPTIMADE query. Some members one get back somehow indicate that they are ranged, and directly in the response you get all the meta-data you need to query over those ranges. Based on that, a subsequent query with added OPTIMADE range parameters now give you a response where these keys instead map to actual data. Note: this is meant to just as closely as possible reflect your own design in #377.

So, something like this. A first regular query to the trajectory endpoint gives:

...
"ranged_cartesian_site_positions": {
  "range_id": "frames",
  "serialization_format": "linear",
  "offset_linear": 0,
  "step_size_linear": 1.5,
  "count": 4711,   
},
"ranged_species_at_sites": {
  "range_id": "frames",
  "frame_serialization_format": "explicit_custom_sparse",
  "sparse_frames", [0,4,5,9],
  "values":[273,273,293,293],
  "count": 4711,   
},
...

(Note how I implicitly have suggested here that range_id is a string identifier that connects multiple properties to the same range, i.e., your "index groups").

Looking at this response, you may, e.g., send a follow-up query with, e.q., the added REST query parameters: range_frames_start=12&range_frames_count=42, which would give you, e.g.,:

...
"ranged_cartesian_site_positions": [
  ... actual data for frame 12 to 53...
],
"ranged_species_at_sites": [
  ... actual data for frame 12 to 53...
]
...

This leads to very sensible syntax for querying either on the meta-data or ranged data.

For meta-data, just use the subfield query syntax and no range parameters, e.g.:
```
  filter=_exmpl_ranged_my_long_intlist,count>96 
```
to get all entries where exmpl_ranged_my_long_intlist contains more than 96 values.
For ranged data, just use the list query syntax along with a range parameter for the range you want to query on:
```
filter=_exmpl_ranged_my_long_intlist HAS 42&range_intlist_start=1&range_intlist_count=1000
```
to get all entries where the number 42 occurs in the first 1000 values.

Finally, I suggest to take a look if you can cut down the options of the serialization_format. It seems to me there is some overlap in what they can provide, and interoperability is going to take a hit if there are too many ways of returning data. If one serialization format can be re-mapped into another cheaply on-the-fly as the endpoint returns data, then there really is no need for both IMO.

JPBergsma · 2022-12-05T17:05:48Z

We were already discussing in #377 what to return for "ranged properties" when no range was provided. So, I would suggest to throw all the meta-data needed to handle these properties into those responses, instead of having separate properties as you suggest above, e.g., for "ranges" and "index groups".

The bit of remaining data in "index_groups" could indeed be moved to the individual properties. We would only duplicate the length of the range, which is not much data compared to the whole property.

So, very much in line with your own design in #377: the user first just sends a standard OPTIMADE query. Some members one get back somehow indicate that they are ranged, and directly in the response you get all the meta-data you need to query over those ranges. Based on that, a subsequent query with added OPTIMADE range parameters now give you a response where these keys instead map to actual data. Note: this is meant to just as closely as possible reflect your own design in #377.

Yes, that is correct.

In the example you give, the ranged_species_at_sites property already has the fields values and sparse_frames in the first response. Those are however the fields that contain the main data, so I would only return those when specifically requested.

"range_id" seems to be a good name for what I called "index group".

The query parameters you suggest for retrieving ranges (range_frames_start and range_frames_count) are different from the ranges query parameter I suggested. Is there a reason you prefer these ?
The ranges query parameter I proposed has the advantage that you can retrieve multiple ranges in one request.

I was not planning to support queries on the data of the properties itself, as this would often be way too much information to process. We could add it in a future PR if there is a demand for it.
That is also why I thought about adding the aggregated properties (min, max, average, and set).

Considering the "serialization_format", we could perhaps drop constant, as I am not sure whether it adds that much compared to the normal way we declare variables. You could argue that constant is the default value for a property and that only when it is not constant you turn it into a ranged property. You could also use the linear option and set the step_size_linear to zero, so you effectively also have a constant value.

rartino · 2022-12-06T11:43:37Z

In the example you give, the ranged_species_at_sites property already has the fields values and sparse_frames in the first response. Those are however the fields that contain the main data, so I would only return those when specifically requested.

I'm not sure I follow this - if you mean that one should do something different from what I wrote as an example, can you edit the example?

The query parameters you suggest for retrieving ranges (range_frames_start and range_frames_count) are different from the ranges query parameter I suggested. Is there a reason you prefer these ?
The ranges query parameter I proposed has the advantage that you can retrieve multiple ranges in one request.

Apologies, I missed how you meant that as the query parameter syntax. While your syntax is more compact, it is more complex and will require actual parsing on the server side; maybe even a grammar... Is there any real benefit in trying to overload the query parameters this way other than to save URL space? By separating it up in multiple query parameters, it is also easier to specify that, e.g., range_frames_stepsize is allowed to be supported but not required. (I really like to design optional features so they lead to zero additional overhead for someone who does not want to bother with them.)

Furthermore, if we want to standardize range_frames_stepsize (like this, or with your syntax), we need to very carefully spell out how this interacts with the various ways data can be sparsely represented. Have you already thought about these things?

I was not planning to support queries on the data of the properties itself, as this would often be way too much information to process. We could add it in a future PR if there is a demand for it.

The complexity overhead for the specification in including this as an optional feature is minimal. It is going to be a single sentence. It also makes sense to allow it, when thinking of possible uses of ranged data outside of MD frames.

That is also why I thought about adding the aggregated properties (min, max, average, and set).

These are good - they apply to the whole data, and thus IMO should go in the response for the property key when not requesting a range. That makes it very easy to add your own data-base-specific aggregated properties. The only thing I don't see is a way to ask for, e.g., the average over a sub-range of data, but I'm OK with that feature not being supported.

Considering the "serialization_format", we could perhaps drop constant, as I am not sure whether it adds that much compared to the normal way we declare variables.

Good. But in your opinion, all the others are needed? I'll take a look through them to see if I agree. As I said, if someone can map one into another cheaply on-the-fly, then both are not necessary.

JPBergsma · 2022-12-06T18:24:04Z

I'm not sure I follow this - if you mean that one should do something different from what I wrote as an example, can you edit the example?

Here, I have placed what I think should be returned for the ranged fields when the fields are not specified with the range parameter.

...
"ranged_cartesian_site_positions": {
  "range_id": "frames",
  "serialization_format": "linear",
  "offset_linear": 0,
  "step_size_linear": 1.5,
  "count": 4711,   
},
"ranged_species_at_sites": {
  "range_id": "frames",
  "frame_serialization_format": "explicit_custom_sparse",
  "count": 4711,   
}, ...

Apologies, I missed how you meant that as the query parameter syntax. While your syntax is more compact, it is more complex and will require actual parsing on the server side; maybe even a grammar... Is there any real benefit in trying to overload the query parameters this way other than to save URL space? By separating it up in multiple query parameters, it is also easier to specify that, e.g., range_frames_stepsize is allowed to be supported but not required. (I really like to design optional features so they lead to zero additional overhead for someone who does not want to bother with them.)

The way I formulated it now probably does require a grammar update. The advantage over the method, you suggested/in the original trajectory proposal, is that you can specify the range per property, and that it also specifies for which properties you want to retrieve all the data. In the old design, we would have to do multiple queries if we would want to use multiple different ranges. Each time, listing the appropriate fields under the response_fields parameter.

Furthermore, if we want to standardize range_frames_stepsize (like this, or with your syntax), we need to very carefully spell out how this interacts with the various ways data can be sparsely represented. Have you already thought about these things?

I am not sure what you mean exactly with "range_frames_stepsize".

I did think about what to do in case of sparse data when implementing the original trajectory proposal.
If no step size was specified (only a first and/or last frame) I would only return a value for those indexes for which a value has been defined. In case the step size parameter was present, I would return a value for each index specified by the step size parameter and None if no value had been defined for that index.

The complexity overhead for the specification in including this as an optional feature is minimal. It is going to be a single sentence. It also makes sense to allow it, when thinking of possible uses of ranged data outside of MD frames.

As long as it is fully optional, we could add it, although I wonder whether it will be implemented.

These are good - they apply to the whole data, and thus IMO should go in the response for the property key when not requesting a range. That makes it very easy to add your own data-base-specific aggregated properties. The only thing I don't see is a way to ask for, e.g., the average over a sub-range of data, but I'm OK with that feature not being supported.

Yes, that is the idea. My idea is that OPTIMADE is for data retrieval and not analysis, so calculating the averages over parts of the range does not seem like a core task of OPTIMADE for me.
Although I do know that some people actually deposited their trajectories into the BioExcel database because they also do some analysis.

Good. But in your opinion, all the others are needed? I'll take a look through them to see if I agree. As I said, if someone can map one into another cheaply on-the-fly, then both are not necessary.

Now I think of it, we could also drop "explicit". You could use "explicit_regular_sparse" instead, and then specify the step_size_sparse as 1 and the offset_sparse as 0.
(I think that if we have fewer options we could make the names a bit simpler too.)
That leaves us with three options, and each would include significantly more data as the others. Therefore, I think three methods("linear" (the value is defined by a linear equation), "regular"(the values are defined at fixed intervals), and "custom" (for each value the index is explicitly specified.) would be optimal.
In principle, we could still drop "regular" but that would double the size for some properties.
In theory, we could even make a more general equation option instead of linear, but that may require a whole new discussion on how to define equations unless we can find an already existing standard.

JPBergsma · 2023-06-19T14:41:24Z

In the metadata PR #463 that was recently merged, we currently do not allow metadata that applies to the whole entry.
Yet in the Partial data PR #467 we have the field partial_data_links that applies to the entry as a whole. Therefore, I am wondering whether I should rewrite the definition of partial_data_links, so it uses the per property metadata fields.

rartino · 2023-06-19T15:35:06Z

@JPBergsma IMO the specification is good as it is in this regard. partial_data_links is completely defined as the channel in which partial data is communicated via JSON:API-formatted responses, in parity with how we define property_metadata. It does not need to be seen as "metadata that applies to the whole entry" (and arguably it very much isn't; just like property_metadata it has separate keys for each property.) Give these parts of the specification another read as they presently are expressed in the develop branch and see if you agree.

JPBergsma · 2023-06-20T22:26:41Z

I was just wondering whether the fields in partial_data_links should have property definitions, as you wrote you did not yet know where these should go for fields directly under meta.

rartino · 2023-06-21T00:35:22Z

partial_data_links doesn't need a property definition. It can only deliver partial data links in a format that is fixed and already fully documented in the specification. (Which should go in the static part of the JSON:API response schema, but that is not the same thing.)

JPBergsma · 2023-06-21T11:24:42Z

Ok, then we can leave it like this.

rartino · 2024-01-09T14:53:42Z

I do not see why this issue cannot be closed with the merging of #467? If I don't hear anything here I will close it.

ml-evs · 2024-03-25T13:13:11Z

I do not see why this issue cannot be closed with the merging of #467? If I don't hear anything here I will close it.

I concur! Closed by #467

JPBergsma added topic/property-standardization The specification of the precise data representation of properties and entries status/has-concrete-suggestion This issue has one or more concrete suggestions spelled out that can be brought up for consensus. PR/requires-discussion labels Dec 1, 2022

rartino mentioned this issue Dec 19, 2022

Collections endpoint #386

Open

JPBergsma linked a pull request Jan 11, 2023 that will close this issue

Add ranged properties #452

Closed

rartino mentioned this issue Jun 11, 2023

The road to trajectories #469

Open

5 tasks

ml-evs closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing large data objects #428

Sharing large data objects #428

JPBergsma commented Dec 1, 2022 •

edited

Loading

rartino commented Dec 5, 2022 •

edited

Loading

JPBergsma commented Dec 5, 2022

rartino commented Dec 6, 2022 •

edited

Loading

JPBergsma commented Dec 6, 2022

JPBergsma commented Jun 19, 2023

rartino commented Jun 19, 2023

JPBergsma commented Jun 20, 2023

rartino commented Jun 21, 2023

JPBergsma commented Jun 21, 2023 •

edited

Loading

rartino commented Jan 9, 2024

ml-evs commented Mar 25, 2024

Sharing large data objects #428

Sharing large data objects #428

Comments

JPBergsma commented Dec 1, 2022 • edited Loading

Marking

Indexing

Index_groups

Property descriptors

Sub properties for Querying

rartino commented Dec 5, 2022 • edited Loading

JPBergsma commented Dec 5, 2022

rartino commented Dec 6, 2022 • edited Loading

JPBergsma commented Dec 6, 2022

JPBergsma commented Jun 19, 2023

rartino commented Jun 19, 2023

JPBergsma commented Jun 20, 2023

rartino commented Jun 21, 2023

JPBergsma commented Jun 21, 2023 • edited Loading

rartino commented Jan 9, 2024

ml-evs commented Mar 25, 2024

JPBergsma commented Dec 1, 2022 •

edited

Loading

rartino commented Dec 5, 2022 •

edited

Loading

rartino commented Dec 6, 2022 •

edited

Loading

JPBergsma commented Jun 21, 2023 •

edited

Loading