Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial data #467

Merged
merged 62 commits into from
Jun 16, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
70f6d9b
Add header for parial data appendix
rartino Jun 8, 2023
9a5a36f
First paragraph of partial data appendix
rartino Jun 8, 2023
0680b2f
Adding a JSON-API response example and to partial data examples.
sauliusg Jun 8, 2023
4063e13
Updating the partial response examples.
sauliusg Jun 8, 2023
1a7230f
A format of partial data URLs agreed with Giovanni.
sauliusg Jun 8, 2023
14b9e9d
Removing scaffold comments.
sauliusg Jun 8, 2023
104ed78
Fixinhg the formatting: removing trailing blanks, unfolding text lines.
sauliusg Jun 8, 2023
9fa3b29
Updating the partial data examples to be consistent with the new
sauliusg Jun 8, 2023
6ec5a11
Checking spelling, updating the ".words.lst" file.
sauliusg Jun 8, 2023
92e7c12
Full text of partial data format appendix
rartino Jun 8, 2023
cc3da46
Merge branch 'partial_data' of https://github.com/rartino/OPTIMADE in…
sauliusg Jun 8, 2023
7a1f684
Slight changes in the text.
sauliusg Jun 8, 2023
0245999
Apply suggestions from review
rartino Jun 8, 2023
06d6444
Apply suggestions from review
rartino Jun 8, 2023
3e5fa16
Delete trailing whitespace
rartino Jun 8, 2023
7eddd27
Fix descriptio of the data -> meta fields in the JSON response format
rartino Jun 8, 2023
3e1f04c
Fixing the "next" link definition.
sauliusg Jun 9, 2023
05eacc2
Update optimade.rst
rartino Jun 10, 2023
beaaeef
Apply suggestions from review
rartino Jun 10, 2023
50c355e
Update based on review
rartino Jun 10, 2023
7a92260
Revert unneseccary change to .words.lst
rartino Jun 11, 2023
8f4db09
Apply suggestions from review
rartino Jun 12, 2023
16d60f6
Slightly change the format of the markers
rartino Jun 12, 2023
e109706
Improve clarity for when number of lines does not match response_range
rartino Jun 12, 2023
34bdf2a
Remove trailing whitespace
rartino Jun 12, 2023
961f5b7
Apply suggestions from review
rartino Jun 14, 2023
874bd52
Apply suggestions from review
rartino Jun 15, 2023
7b314af
Add a key to the header to identify the format as OPTIMADE partial data
rartino Jun 15, 2023
6faf8db
Remove trailing whitespace
rartino Jun 15, 2023
316df78
Clarify handling of missing items in partial data
rartino Jun 15, 2023
b080cf2
Change markers to be more detectable in stream
rartino Jun 15, 2023
bd93804
Change markers to be more detectable in stream
rartino Jun 15, 2023
10bc845
Change markers to be more detectable in stream
rartino Jun 15, 2023
39d9ae5
Change format to representation to avoid a clash in terms and fieldnames
rartino Jun 15, 2023
2a24c1a
Enable for efficient parsing of responses a server knows has no refer…
rartino Jun 15, 2023
9d9e26e
Change format to representation to avoid a clash in terms and fieldnames
rartino Jun 15, 2023
ff5a27c
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
8ae1928
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
d8a11cb
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
11900c5
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
1b4093e
Remove trailing whitespace
rartino Jun 15, 2023
496b6ca
Change representation to layout to not confuse with URL representatio…
rartino Jun 15, 2023
4d906a2
Remove accidental leftover text.
rartino Jun 15, 2023
b6ab3ae
Fix segment incorrectly placed
rartino Jun 15, 2023
ee4c1e3
Fix braces in partial data examples
rartino Jun 15, 2023
1b0d1a6
Make returned_range RECOMMENDED and move a sentence that had ended up…
rartino Jun 15, 2023
1b9c607
Fix whitespace
rartino Jun 15, 2023
562d651
Improve formulation about partial data URLs
rartino Jun 15, 2023
498d169
Slightly adjust wording
rartino Jun 15, 2023
e5e6046
Slightly adjust wording
rartino Jun 15, 2023
4906c4f
Slightly adjust wording
rartino Jun 15, 2023
864450d
Slightly adjust wording
rartino Jun 15, 2023
e574106
Minor reformulations
rartino Jun 15, 2023
336ef21
Minor reformulations
rartino Jun 15, 2023
93ee583
Rearrange some text to be more logical
rartino Jun 15, 2023
edf4f25
Clarify optimade-partial-data/format field futureproofing
rartino Jun 15, 2023
5b13315
Minor reformulations and adjustments
rartino Jun 15, 2023
2cfe8c0
Allow an inline item_schema in addition to the link
rartino Jun 15, 2023
4e9fb4d
Fix missing quotation marks
rartino Jun 15, 2023
b50d93d
Minor language corrections from review
rartino Jun 16, 2023
dfc24d4
Add sentence about implementations decision on what is partial data
rartino Jun 16, 2023
a0aa533
Merge branch 'develop' into partial_data
rartino Jun 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .words.lst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
personal_ws-1.1 en 205
personal_ws-1.1 en 209
ABNF
ACM
Aa
Expand Down Expand Up @@ -86,6 +86,7 @@ bandgap
bd
booktitle
boolean
bzip
calc
cartesian
checksums
Expand Down Expand Up @@ -115,18 +116,21 @@ exclusiveMinimum
exmpl
fieldname
firstname
hdf
howpublished
href
html
http
hydrogens
hydroperoxide
implementers
incrementing
internaldb
javascript
json
jsonapi
jsonc
jsonlines
kvak
lastname
libc
Expand Down Expand Up @@ -203,4 +207,4 @@ xy
yacc
zeo
zeolites
�ngstr�m
ångström
rartino marked this conversation as resolved.
Show resolved Hide resolved
177 changes: 176 additions & 1 deletion optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -593,6 +593,23 @@ Every response SHOULD contain the following fields, and MUST contain at least :f
- **data**: The schema of this value varies by endpoint, it can be either a *single* `JSON API resource object <http://jsonapi.org/format/1.0/#document-resource-objects>`__ or a *list* of JSON API resource objects.
Every resource object needs the :field:`type` and :field:`id` fields, and its attributes (described in section `API Endpoints`_) need to be in a dictionary corresponding to the :field:`attributes` field.

The :field:`data` field MAY also contain a :field:`meta` field with the following keys:
rartino marked this conversation as resolved.
Show resolved Hide resolved

- **property_metadata**: an object containing per-entry and per-property metadata.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The keys are the names of the fields in :field:`attributes` for which metadata is available.
The values belonging to these keys are dictionaries containing the relevant metadata fields.

- **partial_data_urls**: an object used to list URL:s which can be used to fetch data that has been omitted from the :field:`data` part of the response.
rartino marked this conversation as resolved.
Show resolved Hide resolved
rartino marked this conversation as resolved.
Show resolved Hide resolved
The keys are the names of the fields in :field:`attributes` for which partial data URLs are available.
Each value is a list of items that MUST have the following keys:

- **format**: String.
A name of the format provided via this URL.
One of the items SHOULD be "json lines", which refers to the format in `OPTIMADE JSON lines partial data format`_.

- **url**: String.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The URL from which the data can be fetched.

giovannipizzi marked this conversation as resolved.
Show resolved Hide resolved
The response MAY also return resources related to the primary data in the field:

- **links**: `JSON API links <http://jsonapi.org/format/1.0/#document-links>`__ is REQUIRED for implementing pagination.
Expand Down Expand Up @@ -915,7 +932,8 @@ OPTIONALLY it can also contain the following fields:

- **self**: the entry's URL

- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that contains non-standard meta-information about the object.
- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that is used to communicate metadata.
See `JSON Response Schema: Common Fields`_ for more information about this field.
Comment on lines +984 to +985
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not come from PR #463?

Suggested change
- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that is used to communicate metadata.
See `JSON Response Schema: Common Fields`_ for more information about this field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define the meta field here to hold the "partial_data_links" key? Otherwise this PR would be inconsistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I came across another commit which removed part of the definition of the metadata fields. So it looked like you forgot this piece, which is why I mentioned it.
Either both should be in or both should be out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier I indeed removed a segment here that defined the property_metadata subkey of meta, which I agree belong better in #463. But, the segment you have marked now defines the meta superkey we need for the partial_data_links subkey.

I'm confused over what you are asking for. Are you saying we absolutely should not mention meta with a link to 'JSON Response Schema: Common Fields' that defines meta -> partial_data_links; despite that with this PR that key is an absolutely vital part of the 'Entry Listing JSON Response Schema'?


- **relationships**: a dictionary containing references to other entries according to the description in section `Relationships`_ encoded as `JSON API Relationships <https://jsonapi.org/format/1.0/#document-resource-object-relationships>`__.
The OPTIONAL human-readable description of the relationship MAY be provided in the :field:`description` field inside the :field:`meta` dictionary of the JSON API resource identifier object.
Expand Down Expand Up @@ -3200,6 +3218,163 @@ Relationships with files may be used to relate an entry with any number of :entr
Appendices
==========

OPTIMADE JSON lines partial data format
---------------------------------------
The OPTIMADE JSON lines partial data format is a lightweight format for transmitting property data that are too large to fit in a single OPTIMADE response.
The format is based on `JSON Lines <https://jsonlines.org/>`__, which allows for streaming handling of large datasets.

To communicate a property using this format, the usual OPTIMADE response gives the value :val:`null` for the property.
Furthermore, a URL is given which can be used to fetch the missing data.
rartino marked this conversation as resolved.
Show resolved Hide resolved
For responses that use the JSON response format, a subfield :field:`partial_data_urls` of the resource object metadata field, :field:`meta`, is used, see `JSON Response Schema: Common Fields`_.

.. _slice object:
rartino marked this conversation as resolved.
Show resolved Hide resolved

To aid the definition of the "json lines" format below, we first define a "slice object" to be a JSON object describing slices of arrays.
The dictionary has the following OPTIONAL fields:

- :field:`"start"`: Integer.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The slice starts at the value with the given index (inclusive).
The default is 0, i.e., the value at the start of the array.
- :field:`"stop"`
The slice ends at the value with the given index (inclusive).
JPBergsma marked this conversation as resolved.
Show resolved Hide resolved
If omitted, the end of the slice is not specified.
If the end of the slice is not specified when used to express the values included in a response, the client has to count the number of items to know the end.
rartino marked this conversation as resolved.
Show resolved Hide resolved
If the slice refers to a requested range of items, to omit :field:`stop` has the same meaning as specifying the last index of the array.
- :field:`"step"`
The absolute difference in index between two subsequent values that are included in the slice.
The default is 1, i.e., every value in the range indicated by :field:`start` and :field:`stop` is included in the slice.
For example, a value of 2 denotes a slice of every second value in the array.
rartino marked this conversation as resolved.
Show resolved Hide resolved

Furthermore, we also define the following special markers:

- The "end-of-data--marker" is this exact JSON: :val:`[["end"], ""]`.
rartino marked this conversation as resolved.
Show resolved Hide resolved
- A "reference-marker" is this exact JSON: :val:`[["ref"], "URL"]`, where :val:`"URL"` is to be replaced with a URL being referenced.
- A "next-marker" is this exact JSON: :val:`[["next"], "URL"]`, where :val:`"URL"` is to be replaced with the target URL for the next link.

These JSON markers have been deliberately designed as lists with items of mixed data types, and thus cannot be encountered inside the actual data of an OPTIMADE property.
rartino marked this conversation as resolved.
Show resolved Hide resolved

The full response MUST be valid `json lines <https://jsonlines.org/>`__ that adheres to the format:
merkys marked this conversation as resolved.
Show resolved Hide resolved

- The first line is a header object (defined below)
- The following lines are data lines adhering to the formats described below.
- The final line is either an end-of-data--marker (indicating that there is no more data to be given), or a next-marker indicating that more data is available, which can be obtained by retrieving data from the provided URL.

The first line MUST be a JSON object providing header information.
The header object MUST contain the key:

- :field:`"format"`: String.
rartino marked this conversation as resolved.
Show resolved Hide resolved
A string either equal to :val:`"dense"` or :val:`"sparse"` to indicate whether the returned format is dense or sparse.

The header object MAY also contain the key:

- :field:`"returned_ranges"`: Array of Object.
rartino marked this conversation as resolved.
Show resolved Hide resolved
For dense data and sparse data of one dimensional list properties, the array contains a single element which is a `slice object`_ representing the range of data present in the response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For dense data and sparse data of one dimensional list properties, the array contains a single element which is a `slice object`_ representing the range of data present in the response.
For dense data, and sparse data of one dimensional list properties, the array contains a single element which is a `slice object`_ representing the range of data present in the response.

Or do you mean that, if dense, this still applies only for 1D data? Then to be clear I would write "For one dimensional list properties (both dense data and sparse data), ..." to avoid confusion. actually reading again I think this second case is what you mean?

Once the client has encountered an end-of-data--marker, any data not covered by any of the encountered slices are to be assigned the value :val:`null`.
If the field :field:`"format"` is `"dense"` and :field:`"returned_ranges"` is omitted, then the client MUST assume that the data is a continuous range of data from the start of the array up to the number of elements given until reaching the end-of-data--marker or next-marker.
In the specific case of a hierarchy of list properties represented as a sparse multi-dimensional array, if the field :field:`"returned_ranges"` is given, it MUST contain one slice object per dimension of the multi-dimensional array, representing slices for each dimension that cover the data given in the response.

The format of data lines of the response (i.e., all lines except the first and the last) depends on whether the header object specifies the format as :val:`"dense"` or :val:`sparse`.

- **Dense format:** In the dense partial data format, each data line reproduces one list item in the OPTIMADE list property being transmitted in JSON format.
If OPTIMADE list properties are embedded inside the item, they can either be included in full or replaced with a reference-marker.
If a list is replaced by a reference marker, the client MAY use the provided URL to obtain the list items, which is then also provided in the JSON lines partial data format.
rartino marked this conversation as resolved.
Show resolved Hide resolved

- **Sparse format for one-dimensional list:** When the response sparsely communicates items for a one-dimensional OPTIMADE list property, each data line contains a JSON array on the format:

- The first item is the index of the item provided.
rartino marked this conversation as resolved.
Show resolved Hide resolved
- The second item is a JSON representation of the item, on the same format as the lines in the dense format.
rartino marked this conversation as resolved.
Show resolved Hide resolved
In the same way as for the dense format, reference-markers are allowed for data that does not fit in the response.
rartino marked this conversation as resolved.
Show resolved Hide resolved

- **Sparse format for multi-dimensional lists:** Specifically for the case that the OPTIMADE property represents a series of directly hierarchically embedded lists, the server MAY represent them using a sparse multi-dimensional format.
rartino marked this conversation as resolved.
Show resolved Hide resolved
In this case, each data line contains a JSON array in the format of:

- All items except the last item are coordinates providing indices in the embedded dimensions in the order of outermost to innermost.
rartino marked this conversation as resolved.
Show resolved Hide resolved
- The last item is a JSON representation of the item at those coordinates, on the same format as the lines in the dense format.
rartino marked this conversation as resolved.
Show resolved Hide resolved
In the same way as for the dense format, reference-markers are allowed for data that does not fit in the response.


Examples
--------
rartino marked this conversation as resolved.
Show resolved Hide resolved

An example of an OPTIMADE JSON-API response that contains a link to a partial data protocol URL:
rartino marked this conversation as resolved.
Show resolved Hide resolved

.. code:: json
{
"data": {
"type": "structures",
"id": "2345678",
"attributes": {
"a": null
}
}
"meta": {
"partial_data_urls": {
"a": [
{
"format": "plain-jsonlines",
rartino marked this conversation as resolved.
Show resolved Hide resolved
"url": "https://example.db.org/assets/partial_values/structures/2345678/a/default_format"
},
{
"format": "bzip2-jsonlines",
"url": "https://example.db.org/assets/partial_values/structures/2345678/a/bzip2_format"
},
{
"format": "hdf5",
rartino marked this conversation as resolved.
Show resolved Hide resolved
"url": "https://example.db.org/assets/partial_values/structures/2345678/a/hdf5"
}
]
}
"property_metadata": {
"a": {

}
}
}
}

An example of a dense response for a partial array data, scalar values:
rartino marked this conversation as resolved.
Show resolved Hide resolved
rartino marked this conversation as resolved.
Show resolved Hide resolved

.. code:: json
{"format": "dense", "returned_ranges": [{"start": 10, "stop": 20, "step": 2}]}
123
345
-12.6
[["next"], "https://example.db.org/value4"]
rartino marked this conversation as resolved.
Show resolved Hide resolved

An example of a dense response for a partial array data, multidimensional array values:
rartino marked this conversation as resolved.
Show resolved Hide resolved

.. code:: json
{"format": "dense", "returned_ranges": [{"start": 10, "stop": 20, "step": 2}]}
[[10,20,21], [30,40,50]]
[["ref"], "https://example.db.org/value2"]
[[11, 110], [["ref"], "https://example.db.org/value3"], [550, 333]]
[["next"], "https://example.db.org/value4"]
rartino marked this conversation as resolved.
Show resolved Hide resolved

An example of a sparse response for a partial array data with aggregated dimensions, single dimension array:
rartino marked this conversation as resolved.
Show resolved Hide resolved

.. code:: json
{"format": "sparse"}
[3,5,19, [10,20,21,30]]
[30,15,9, [["ref"], "https://example.db.org/value1"]]
[["next"], "https://example.db.org/"]
rartino marked this conversation as resolved.
Show resolved Hide resolved

An example of a sparse response for a partial array data with aggregated dimensions, scalar values:

.. code:: json
{"format": "sparse"}
[3,5,19, 10]
[30,15,9, 31]
[["next"], "https://example.db.org/"]
rartino marked this conversation as resolved.
Show resolved Hide resolved

An example of a sparse response for a partial array data with aggregated dimensions, multidimensional array:

.. code:: json
{"format": "sparse"}
[3,5,19, [ [10,20,21], [30,40,50] ]
[3,7,19, [["ref"], "https://example.db.org/value2"]]
[4,5,19, [ [11, 110], [["ref"], "https://example.db.org/value3"], [550, 333]]
[["end"], ""]

The Filter Language EBNF Grammar
--------------------------------

Expand Down