Skip to content

Commit

Permalink
v2.0 ADC API doc changes (#752)
Browse files Browse the repository at this point in the history
* Update facet docs

As per #617

* Removal/deprecation of is and not operators

* New release notes file for ADC API

Added deprecation of is and not.

* Error codes, repository loading changes

As per #431 and #487

* Add 408 and 413 errors

* Added 408 and 413 errors

* Add docs for AA/nt case discussion

As per #528

* Update data loading recommendation

* Remove docs about deprecated not operator

* Update to array query docs.

* Typo fix
  • Loading branch information
bcorrie authored Feb 20, 2024
1 parent 7b1db3c commit 4cbad02
Show file tree
Hide file tree
Showing 5 changed files with 344 additions and 58 deletions.
32 changes: 30 additions & 2 deletions docs/api/adc_api_overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,41 @@ to be followed.
requests.
* If an API endpoint returns a field, then the content of that field
in the JSON and TSV response must be equivalent.

* For those fields that contain Amino Acid or Nucleotide strings, the case for the
characters (upper or lower case) is not stated in the specification. Repository
implementations should expect upper or lower case queries for these fields. Repositories
may want to enforce internal characteristics for these fields (e.g. AA are always upper case,
nt are always lower case) to facilitate efficient storage and searching. Because case is not
stated, repositories can return amino acid and nucleotide sequences using the case utilized
internally.
* Relevant HTTP error codes should be returned on error conditions. HTTP 408
(timeout) should be used if the API does not complete an operation because of an
internal time limit, and HTTP 413 (Content too large) should be returned when either
max_size or max_query_size are exceeded.

Repository operation principles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Research groups that are running repositories as part of the AIRR Data Commons should,
to the best of their ability, ensure that their repository uptime is maintained and that
repository queries on fields that have the adc_query_support attribute set are completed in a timely manner.

In order to maximize scientific reproducibility and data provenance, it is recommended that
data stewards/data curators avoid releasing partially loaded data into the AIRR Data Commons.
When loading a study it is recommended that all data from a specific AIRR Schema object
(e.g. Rearrangement, Clone, Cell) be loaded and then made accessible
in the ADC as a single package, rather than having the repository accessible in the ADC
while the data is being loaded.
Piecemeal data loading of data for a specific schema object (e.g. Rearrangement) for a
study in a production repository will result in queries returning different results as
searches are made over time. This can lead to consumers of the data receiving confusing results,
makes for complicated data provenance, and hampers scientific reporducibility.

Authentication
~~~~~~~~~~~~~~

The ADC API currently does not define an authentication method. Future
versions of the API will provide an authentication method so data
versions of the API may provide an authentication method so data
repositories can support query and download of controlled-access data.


Expand Down
15 changes: 15 additions & 0 deletions docs/api/adc_api_release_notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
AIRR Data Commons API Release Notes
================================================================================

Version 2.0: June 2024
--------------------------------------------------------------------------------

**Version 2.0 ADC API release.**

General ADC API Changes:

1. Operators ``is`` and ``not`` have been depcreated. These operators based on the GDC API
are non-inutitive in that the ``not`` operator is not the boolean ``not`` operator. These operators
were functionally equivalent to the more aptly named ``is missing`` and ``is not missing`` operators,
which still remain and should be used.

115 changes: 59 additions & 56 deletions docs/api/adc_api_requests.rst
Original file line number Diff line number Diff line change
Expand Up @@ -253,21 +253,11 @@ The following operators are support by the ADC API.
- n/a
- field is missing or is null
- {"op":"is missing","content":{"field":"sample.tissue"}}
* - is
- field
- n/a
- identical to "is missing" operator, provided for GDC compatibility
- {"op":"is","content":{"field":"sample.tissue"}}
* - is not missing
- field
- n/a
- field is not missing and is not null
- {"op":"is not missing","content":{"field":"sample.tissue"}}
* - not
- field
- n/a
- identical to "is not missing" operator, provided for GDC compatibility
- {"op":"not","content":{"field":"sample.tissue"}}
* - in
- field, multiple values in a list
- array of string, number, or integer
Expand All @@ -294,10 +284,6 @@ The following operators are support by the ADC API.
- logical OR
- {"op":"or","content":[ |br| {"op":"<","content":{"field":"sample.cell_number","value":1000}}, |br| {"op":"is missing","content":{"field":"sample.tissue"}}, |br| {"op":"exclude","content":{"field":"subject.organism.id","value":["9606","10090"]}} |br| ]}

Note that the ``not`` operator is different from a logical NOT
operator, and the logical NOT is not needed as the other operators
provide negation.

The ``field`` operand specifies a fully qualified property name in the AIRR
Data Model. Fully qualified AIRR properties are either a JSON/YAML base type (``string``, ``number``,
``integer``, or ``boolean``) or an array of one of these base types (some AIRR fields are arrays
Expand All @@ -307,45 +293,6 @@ The Fields section below describes the available queryable fields.
The ``value`` operand specifies one or more values when evaluating the
operator for the ``field`` operand.

*Queries Against Arrays*

A number of fields in the AIRR Data Model are arrays, such as
``study.keywords_study`` which is an array of strings or
``subject.diagnosis`` which is an array of ``Diagnosis`` objects. A
query operator on an array field will apply that operator to each
entry in the array to decide if the query filter is satisfied. The
behavior is different for various operators. For operators such as
``=`` and ``in``, the filter behaves like the Boolean ``OR`` over the
array entries, that is if **any** array entry evaluates to true then
the query filter is satisfied. For operators such as ``!=`` and
``exclude``, the filter behaves like the Boolean ``AND`` over the
array entries, that is **all** array entries must evaluate to true for
the query filter to be satisfied.

For complex queries over arrays, it is necessary to compose complex queries into
more than one query. For example consider the following subject:

.. code-block:: bash
* Subject
* diagnosis
(Diagnosis record 1)
* disease_diagnosis: "rheumatoid arthritis"
* disease_length: "20 years"
(Diagnosis record 2)
* disease_diagnosis: "pancreatic ductal adenocarcinoma"
* disease_length: "6 months"
If the end result that is required it to find all disease diagnoses of "pancreatic ductal adenocarcinoma"
that have a disease length of over 10 years, searching for ``disease_diagnosis = pancreatic ductal adenocarcinom`` and ``disease_length > 10``
will result in the above Subject being returned, even though the subject has not had pancreatic ductal adenocarcinom for more than 10 years.
This is because there is a diagnosis of pancreatic ductal adenocarcinom and a disease length
of more than 10 years but from a different diagnoses. This is a correct response to the query, but does not return the desired outcome.

In order to achieve the desired outcome, it is necessary to search for one of the conditions (e.g. ``disease_diagnosis = pancreatic ductal adenocarcinom``),
compile a list of ``repertoire_ids`` that meet that condition, and then search for the second condition (e.g. ``disease_length > 10``)
across those ``repertoire_ids``.

*Examples*

Expand Down Expand Up @@ -582,17 +529,20 @@ subjects each with two IGH repertoires.
]
}
Note: ADC API facet requests differ from those in the GDC API on which the ADC API is based. In the ADC
API it is allowed to request a facet count on a field that is being filtered, whereas in the GDC API
filters on the facet'ed field are ignored (see `Genomic Data Commons (GDC) API Facets`_ restriction #2).

Queries on Nested Information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Queries on Nested Information (Arrays)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As stated above, in general API response data will be have been
flattened by the query handler. However, there are several instances in
which properties within the top-level entities are arrays of objects,
which cannot be flattened because all the information will be expected
to present in the response. Therefore, in these cases, the data that is
queried and potentially returned will be nested. In addition, while the
array of object is obvious from the AIRR Schema, they array component
array of object is obvious from the AIRR Schema, the array component (index)
does **not** appear in the hierarchical property names used by the API.
Note that this does not create any collisions as the schema allows the
existence of multiple properties with the same designation.
Expand Down Expand Up @@ -623,3 +573,56 @@ to exhibit "local" behavior, as is easier to implement on the
client-side, where it would require joining the the result sets of the
queries for each of the properties individually.

*An example query against arrays*

A number of fields in the AIRR Data Model are arrays, such as
``study.keywords_study`` which is an array of strings or
``subject.diagnosis`` which is an array of ``Diagnosis`` objects. A
query operator on an array field will apply that operator to each
entry in the array to decide if the query filter is satisfied. The
behavior is different for various operators. For operators such as
``=`` and ``in``, the filter behaves like the Boolean ``OR`` over the
array entries, that is if **any** array entry evaluates to true then
the query filter is satisfied. For operators such as ``!=`` and
``exclude``, the filter behaves like the Boolean ``AND`` over the
array entries, that is **all** array entries must evaluate to true for
the query filter to be satisfied.

Given the example diagnosis structure:

.. code-block:: bash
* Subject
* diagnosis
(Diagnosis record 1)
* disease_diagnosis: "rheumatoid arthritis"
* disease_length: "20 years"
(Diagnosis record 2)
* disease_diagnosis: "pancreatic ductal adenocarcinoma"
* disease_length: "6 months"
A query of ``disease_diagnosis = pancreatic ductal adenocarcinom`` and ``disease_length > 10``
will result in the above Subject being returned, even though the subject has not had pancreatic
ductal adenocarcinom for more than 10 years. This is because each of the predicates in the query
are true given the above subject. That is the subject has a ``disease_diagnosis = pancreatic ductal adenocarcinom``
and a ``disease_length > 10``. It is currently not possible to perform the above query using the current
implementation of the ADC API.

This query would only result in the desired outcome if and only if there was
one disease record for the subject as given below.

.. code-block:: bash
* Subject
* diagnosis
(Diagnosis record 1)
* disease_diagnosis: "pancreatic ductal adenocarcinoma"
* disease_length: "20 years"
If there is more than one diagnosis, it is necessary to search for one of the criteria
(e.g. ``disease_diagnosis = pancreatic ductal adenocarcinom``), download the resulting data, and determine
if the other criteria is true for that disease record for that subject.

A planned extension to solve this issue is being devloped.

.. _`Genomic Data Commons (GDC) API Facets`: https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#facets
Loading

0 comments on commit 4cbad02

Please sign in to comment.