From d765ef9313200206eee38133cb6d7b6dc9d5db9e Mon Sep 17 00:00:00 2001 From: David Steinberg Date: Mon, 23 May 2016 15:27:19 -0700 Subject: [PATCH] Replace mentions of Avro with pb @calbach's edits @jeromekelleher's edits --- INSTALL.rst | 14 +-- doc/README.rst | 24 ----- doc/source/api/apidesign_intro.rst | 6 +- doc/source/appendix/avro_intro.rst | 72 -------------- doc/source/appendix/json_intro.rst | 140 ++++------------------------ doc/source/appendix/proto_intro.rst | 54 +++++++++++ doc/source/intro.rst | 34 +++---- 7 files changed, 99 insertions(+), 245 deletions(-) delete mode 100644 doc/source/appendix/avro_intro.rst create mode 100644 doc/source/appendix/proto_intro.rst diff --git a/INSTALL.rst b/INSTALL.rst index 21a4e5c6..60ed37f0 100644 --- a/INSTALL.rst +++ b/INSTALL.rst @@ -4,7 +4,7 @@ Installing the GA4GH Schemas The schemas are documents (text files) that formally describe the messages that pass between GA4GH reference servers and clients, which we also refer to collectively as "the API." The schemas are written in a -language called `Avro `__. +language called `Protocol Buffers `__. We use the schemas in a couple of different ways: @@ -14,19 +14,21 @@ We use the schemas in a couple of different ways: Generating Source Code @@@@@@@@@@@@@@@@@@@@@@ -(To be written.) +:: + +$ cd src/main/proto && protoc --python_out=. ga4gh/* Installing the Documentation Tools and Generating Documentation @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -We use a tool called Sphinx to generate the documentation from Avro -input files. +We use a tool called Sphinx to generate the documentation from Protocol +Buffers input files. Install prerequisites ##################### -To use the Sphinx/Avro documentation generator, you must install some -software packages it requires. +To use the Sphinx/Protocol Buffers documentation generator, you must +install some software packages it requires. Maven $$$$$ diff --git a/doc/README.rst b/doc/README.rst index b6f6d9fd..6eb7664e 100644 --- a/doc/README.rst +++ b/doc/README.rst @@ -23,28 +23,6 @@ prerequisites and, for the moment, you'll have to ferret those out yourself.) -Building Process -@@@@@@@@@@@@@@@@ - -The current doc flow is roughly as follows:: - - avdl ----1----> avpr ----2----> rst -| - | ----3----> html - rst -| - - |- doc/source/schema/Makefile -| |-- sphinx --| - |------ top-level Makefile ('make docs') --------| - -* 1 = avro-tools, downloaded on demand; requires java -* 2 = avpr2rest.py, a custom script in tools/sphinx/ -* 3 = sphinx-build, part of the sphinx package - -.. warning:: Because we cannot currently run step 1 at Read the Docs, - it is imperative that developers type `make docs-schema` - at the top level if avdl files are updated, and then - commit the changed rst files. - - Documentation tips @@@@@@@@@@@@@@@@@@ @@ -54,5 +32,3 @@ Documents are written in `ReStructured Text schemas. - Abbreviations are stored in ``epilog.rst``. -- Reference avro elements with ``:avro:key``. - diff --git a/doc/source/api/apidesign_intro.rst b/doc/source/api/apidesign_intro.rst index 6db792ed..67aa8267 100644 --- a/doc/source/api/apidesign_intro.rst +++ b/doc/source/api/apidesign_intro.rst @@ -103,9 +103,9 @@ Unresolved Issues * What is the definition of the wire protocol? HTTP 1.0? Is HTTP 1.1 chunked encoding allowed? What is the specification for the - generate JSON for a given an Avro schema? + generated JSON for a given an Protocol Buffers schema? -* What is the role of Avro? Is it for documentation-only or for use - as an IDL? +* What is the role of Protocol Buffers? Is it for documentation-only + or for use as an IDL? * Need overall object relationship diagram. diff --git a/doc/source/appendix/avro_intro.rst b/doc/source/appendix/avro_intro.rst deleted file mode 100644 index e0cade4d..00000000 --- a/doc/source/appendix/avro_intro.rst +++ /dev/null @@ -1,72 +0,0 @@ -.. _avro: - -******************* -Apache Avro -******************* - -Apache Avro is a data serialization ecosystem, comparable to Google's Protocol Buffers. - -------------------- -What does the GA4GH web API take from Avro? -------------------- - -The GA4GH web API uses the Avro IDL (aka AVDL) language and JSON serialization labraries. - -The GA4GH web API presents a simple HTTP(S) and JSON interface to clients. It does **not** use Avro's binary serialization format, or Avro's built-in client/server networking and RPC features. - ------------------- -How does the GA4GH web API use Avro schemas? ------------------- - -GA4GH web API objects, including both the data objects actually exchanged and the control messages requesting and returning those objects, are defined in the Avro IDL language, AVDL. - -The `full documentation for the AVDL language is abvailable here `_. Bear in mind that the Avro IDL comes with an entire ecosystem; the GA4GH web APIs do not use most of it. - ------------------- -How does the GA4GH Web API use AVDL? ------------------- - -The GA4GH web API schemas are broken up into multiple AVDL files, which reference each other. Each file defines a number of types (mostly Avro Records, with a smattering of Avro Enums), grouped into a "protocol" (which is somewhat of a misnomer) of types defining a facet of the API. Mostly, the files come in pairs: a normal AVDL file defining the types representing actual data, and a "methods" AVDL file defining the control messages to be sent back and forth to query and exchange the representational types, and the URLs associated with various operations. - -Each type has a leading comment documenting its purpose, and each field in the type has a description. These are included in the automatically generated API documentation. - -Here is an example of an AVDL definition from, in this case defining a genomic `Position` type which is used across the API:: - - /** - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - */ - record Position { - /** - The name of the `Reference` on which the `Position` is located. - */ - string referenceName; - - /** - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - */ - long position; - - /** - Strand the position is associated with. - */ - Strand strand; - } - -This is a "record", which contains three fields. All of the fields are required to be filled in, and all of the fields can only hold objects of a particular single type. (In cases where this is not desired, see the AVDL documentation on unions). The last field holds a `Strand` object, which is defined elsewhere in the file. - -~~~~~~~~~~~~~~~~~~ -A note on unions and optional fields -~~~~~~~~~~~~~~~~~~ - -Any field which is optional should be defined as a ``union``, and given a default value of ``null``. Note that ``null`` should always be first in the union, since it is the type of the default value. - -The Avro JSON libraries serialize union types strangely, so the GA4GH API schemas have been specifically designed never to include union types that would trigger this behavior. The upshot of this is that the **only** legal union type is ``union``. Unions with multiple non-``null`` types are not allowed. - -.. todo:: - * How much of the AVDL tutorial do we want in here? - * Document/show an example for methods (request and response pairing pattern) - * Talk about how we manually specify that some things land in URLs - diff --git a/doc/source/appendix/json_intro.rst b/doc/source/appendix/json_intro.rst index bf8025f7..12110a6b 100644 --- a/doc/source/appendix/json_intro.rst +++ b/doc/source/appendix/json_intro.rst @@ -6,13 +6,13 @@ The JSON Format JSON, or JavaScript Object Notation, is officially defined `here `_. It is the standard data interchange format for web APIs. -The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its AVDL schemas. More information on the AVDL schemas is available in :ref:`avro`; basically, the AVDL type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them. +The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its Protocol Buffers schemas. More information on the schemas is available in :ref:`proto`; basically, the Protocol Buffers type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them. ----------------------- GA4GH JSON Serialization ----------------------- -The GA4GH web APIs use Avro IDL to define their schemas, and use the associated Avro JSON serialization libraries. Since the schemas use a restricted subset of AVDL types (see `A note on unions`_ below), the serialized JSON format is fairly standard. This means that standard non-Avro JSON serialization and deserialization libraries (like, for example, the Python ``json`` module) can be used to serialize and deserialize GA4GH JSON messages in an idiomatic way. +The GA4GH web APIs use Protocol Buffers IDL to define their schemas, and use the associated Google Protocol Buffers JSON serialization libraries. Notice that the Protocol Buffers IDL uses snake case, while the on-the-wire protocol is in camel case. --------------------- Serialization example @@ -20,19 +20,19 @@ Serialization example For example, here is the schema definition for Variants (with comments removed):: - record Variant { - string id; - string variantSetId; - array names = []; - union { null, long } created = null; - union { null, long } updated = null; - string referenceName; - long start; - long end; - string referenceBases; - array alternateBases = []; - map> info = {}; - array calls = []; + message Variant { + string id = 1; + string variant_set_id = 2; + repeated string names = 3; + int64 created = 4; + int64 updated = 5; + string reference_name = 6; + int64 start = 7; + int64 end = 8; + string reference_bases = 9; + repeated string alternate_bases = 10; + map info = 11; + repeated Call calls = 12; } Here is a serialized variant in JSON. It's a bit of an edge case in some respects:: @@ -61,114 +61,8 @@ Here is a serialized variant in JSON. It's a bit of an edge case in some respect Things to notice: * A serialized record contains no explicit information about its type. - * Arrays are serialized as JSON arrays. + * "repeated" types are serialized as JSON arrays. * Maps are serialized as JSON objects. - * Records are also serialized as JSON objects. + * Messages are also serialized as JSON objects. * Enums (not shown here) are serialized as JSON strings. - * Nulls are serialized as JSON nulls. - * Fields with default values may be omitted (see the lack of an ``updated`` or ``calls``) as a way of serializing their default values. - * Unions of ``null`` and a non-``null`` type are serialized as either ``null`` or the serialized non-null value. No other kinds of unions are present or permitted. - ------------------------ -A note on unions ------------------------ - -As noted above, a field with union type serialized in GA4GH JSON looks no different from a field of any other type: you just put the field name and its recursively serialized value. In order for the Avro JSON libraries to support this, it is necessary that AVDL ``union`` types union together only ``null`` and a single non-``null`` type. If there were two or more non-``null`` types, the Avro libraries would need to include additional type information to say which to use when deserializing. Since we prohibit those unions, however, API clients and alternative server implementations never need to worry about this additional type information or its syntax. They can just handle "normal" JSON. - -.. todo:: - * add example of Python decoder output - * create a python class, if necessary - ------------------------ -Wire protocol example ------------------------ - -This is from the `ga4gh server example`_. - -.. _ga4gh server example: http://ga4gh-reference-implementation.readthedocs.org/en/stable/demo.html#demo - -To get information from the readgroupsets on a server, create a JSON format request:: - - { - "datasetIds":[], - "name":null - } - -.. note:: - What is this actually asking? - -To send this to the server, we need to create a HTTP request which tells the server what type of -data to expect (JSON format, in this case) -In our test case, we have a server running at \http://localhost:8000 - -Since we want to query the readgroupsets, we'll have to make that part of the URL - -.. note:: - * How do we know it's v0.5.1? - * where is the readgroupsets/search part documented or defined? - -To create a command line request, we can use `cURL `_:: - - curl --data '{"datasetIds":[], "name":null}' --header 'Content-Type: application/json' http://localhost:8000/v0.5.1/readgroupsets/search - -The server returns:: - - { - "nextPageToken": null, - "readGroupSets": [{ - "readGroups": [{ - "info": {}, - "updated": 1432287597662, - "predictedInsertSize": null, - "description": null, - "created": 1432287597662, - "programs": [], - "sampleId": null, - "experiment": null, - "referenceSetId": null, - "id": - "low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522", - "datasetId": null, - "name": - "low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522" - }, - { "info": {}, - "updated": 1432287793946, - "predictedInsertSize": null, - "description": null, - "created": 1432287793946, - "programs": [], - "sampleId": null, - "experiment": null, - "referenceSetId": null, - "id": - "low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522", - "datasetId": null, - "name": - "low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522" - }, - { "info": {}, - "updated": 1432287793946, - "predictedInsertSize": null, - "description": null, - "created": 1432287793946, - "programs": [], - "sampleId": null, - "experiment": null, - "referenceSetId": null, - "id": - "low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522", - "datasetId": null, - "name": - "low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522" - }], - "id": - "low-coverage", - "datasetId": null, - "name": null - } - ] - } - - diff --git a/doc/source/appendix/proto_intro.rst b/doc/source/appendix/proto_intro.rst new file mode 100644 index 00000000..7865b2a2 --- /dev/null +++ b/doc/source/appendix/proto_intro.rst @@ -0,0 +1,54 @@ +.. _proto: + +*********************** +Google Protocol Buffers +*********************** + +Apache Avro is a data serialization ecosystem, comparable to Google's Protocol Buffers. + +------------------------------------------------------- +What does the GA4GH web API take from Protocol Buffers? +------------------------------------------------------- + +The GA4GH web API uses the Google Protocol Buffers language and JSON serialization libraries. + +The GA4GH web API presents a simple HTTP(S) and JSON interface to clients. It does **not** use Protocol Buffers's binary serialization format. + +------------------------------------------------------- +How does the GA4GH web API use Protocol Buffer schemas? +------------------------------------------------------- + +GA4GH web API objects, including both the data objects actually exchanged and the control messages requesting and returning those objects, are defined in Protocol Buffers. + +The full documentation for the Protocol buffers language can be found `here `_. + +------------------------------------------------ +How does the GA4GH Web API use Protocol Buffers? +------------------------------------------------ + +The GA4GH web API schemas are broken up into multiple proto files, which reference each other. Each file defines a number of message types, grouped into a "protocol" that defines a facet of the API. Mostly, the files come in pairs: a normal proto file defining the types representing actual data, and a "methods" proto file defining the control messages to be sent back and forth to query and exchange the representational types, and the URLs associated with various operations. + +Each type has a leading comment documenting its purpose, and each field in the type has a description. These are included in the automatically generated API documentation. + +Here is an example of a proto definition from , in this case defining a genomic `Position` type which is used across the API:: + + message Position { + // The name of the `Reference` on which the `Position` is located. + string reference_name = 1; + + // The 0-based offset from the start of the forward strand for that + // `Reference`. Genomic positions are non-negative integers less than + // `Reference` length. + int64 position = 2; + + // Strand the position is associated with. + Strand strand = 3; + } + +This is a "message", which contains three fields. All of the fields are required to be filled in, and all of the fields can only hold objects of a particular single type. The last field holds a `Strand` object, which is defined elsewhere in the file. + +.. todo:: + * How much of the Protocol Buffers tutorial do we want in here? + * Document/show an example for methods (request and response pairing pattern) + * Talk about how we manually specify that some things land in URLs + diff --git a/doc/source/intro.rst b/doc/source/intro.rst index d090caa8..3e2f61b6 100644 --- a/doc/source/intro.rst +++ b/doc/source/intro.rst @@ -53,29 +53,29 @@ define the types of things that API clients and servers exchange: requests for data, server responses, error messages, and objects actually representing pieces of genomics data. -The schemas are written in Avro Interface Description Language -(extension .avdl). For more details on Avro and how it is used in the -GA4GH APIs, see :ref:`avro`. +The schemas are written in Protocol Buffers Interface Description +Language (extension .proto). For more details on Protocol Buffers +and how it is used in the GA4GH APIs, see :ref:`proto`. Here is an example schema definition for a Variant (with comments removed):: - record Variant { - string id; - string variantSetId; - array names = []; - union { null, long } created = null; - union { null, long } updated = null; - string referenceName; - long start; - long end; - string referenceBases; - array alternateBases = []; - map> info = {}; - array calls = []; + message Variant { + string id = 1; + string variant_set_id = 2; + repeated string names = 3; + int64 created = 4; + int64 updated = 5; + string reference_name = 6; + int64 start = 7; + int64 end = 8; + string reference_bases = 9; + repeated string alternate_bases = 10; + map info = 11; + repeated Call calls = 12; } On the wire, the GA4GH web API takes the form of a client and a server exchanging JSON-serialized objects over HTTP or HTTPS. For more details on JSON, including how the GA4GH web API serializes and -deserializes Avro-specified objects in JSON, see :ref:`json`. +deserializes Protocol Buffers objects in JSON, see :ref:`json`.