Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Commit

Permalink
Merge pull request #624 from david4096/protobuf_docs
Browse files Browse the repository at this point in the history
Replace mentions of Avro with pb
  • Loading branch information
david4096 committed May 24, 2016
2 parents 0ff95ff + d765ef9 commit 8cc34de
Show file tree
Hide file tree
Showing 7 changed files with 99 additions and 245 deletions.
14 changes: 8 additions & 6 deletions INSTALL.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Installing the GA4GH Schemas
The schemas are documents (text files) that formally describe the
messages that pass between GA4GH reference servers and clients, which we
also refer to collectively as "the API." The schemas are written in a
language called `Avro <http://avro.apache.org>`__.
language called `Protocol Buffers <https://developers.google.com/protocol-buffers/>`__.

We use the schemas in a couple of different ways:

Expand All @@ -14,19 +14,21 @@ We use the schemas in a couple of different ways:
Generating Source Code
@@@@@@@@@@@@@@@@@@@@@@

(To be written.)
::

$ cd src/main/proto && protoc --python_out=. ga4gh/*

Installing the Documentation Tools and Generating Documentation
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

We use a tool called Sphinx to generate the documentation from Avro
input files.
We use a tool called Sphinx to generate the documentation from Protocol
Buffers input files.

Install prerequisites
#####################

To use the Sphinx/Avro documentation generator, you must install some
software packages it requires.
To use the Sphinx/Protocol Buffers documentation generator, you must
install some software packages it requires.

Maven
$$$$$
Expand Down
24 changes: 0 additions & 24 deletions doc/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,28 +23,6 @@ prerequisites and, for the moment, you'll have to ferret those out
yourself.)


Building Process
@@@@@@@@@@@@@@@@

The current doc flow is roughly as follows::

avdl ----1----> avpr ----2----> rst -|
| ----3----> html
rst -|

|- doc/source/schema/Makefile -| |-- sphinx --|
|------ top-level Makefile ('make docs') --------|
* 1 = avro-tools, downloaded on demand; requires java
* 2 = avpr2rest.py, a custom script in tools/sphinx/
* 3 = sphinx-build, part of the sphinx package

.. warning:: Because we cannot currently run step 1 at Read the Docs,
it is imperative that developers type `make docs-schema`
at the top level if avdl files are updated, and then
commit the changed rst files.


Documentation tips
@@@@@@@@@@@@@@@@@@

Expand All @@ -54,5 +32,3 @@ Documents are written in `ReStructured Text
schemas.

- Abbreviations are stored in ``epilog.rst``.
- Reference avro elements with ``:avro:key``.

6 changes: 3 additions & 3 deletions doc/source/api/apidesign_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,9 @@ Unresolved Issues

* What is the definition of the wire protocol? HTTP 1.0? Is HTTP 1.1
chunked encoding allowed? What is the specification for the
generate JSON for a given an Avro schema?
generated JSON for a given an Protocol Buffers schema?

* What is the role of Avro? Is it for documentation-only or for use
as an IDL?
* What is the role of Protocol Buffers? Is it for documentation-only
or for use as an IDL?

* Need overall object relationship diagram.
72 changes: 0 additions & 72 deletions doc/source/appendix/avro_intro.rst

This file was deleted.

140 changes: 17 additions & 123 deletions doc/source/appendix/json_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,33 @@ The JSON Format

JSON, or JavaScript Object Notation, is officially defined `here <http://json.org/example>`_. It is the standard data interchange format for web APIs.

The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its AVDL schemas. More information on the AVDL schemas is available in :ref:`avro`; basically, the AVDL type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them.
The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its Protocol Buffers schemas. More information on the schemas is available in :ref:`proto`; basically, the Protocol Buffers type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them.

-----------------------
GA4GH JSON Serialization
-----------------------

The GA4GH web APIs use Avro IDL to define their schemas, and use the associated Avro JSON serialization libraries. Since the schemas use a restricted subset of AVDL types (see `A note on unions`_ below), the serialized JSON format is fairly standard. This means that standard non-Avro JSON serialization and deserialization libraries (like, for example, the Python ``json`` module) can be used to serialize and deserialize GA4GH JSON messages in an idiomatic way.
The GA4GH web APIs use Protocol Buffers IDL to define their schemas, and use the associated Google Protocol Buffers JSON serialization libraries. Notice that the Protocol Buffers IDL uses snake case, while the on-the-wire protocol is in camel case.

---------------------
Serialization example
---------------------

For example, here is the schema definition for Variants (with comments removed)::

record Variant {
string id;
string variantSetId;
array<string> names = [];
union { null, long } created = null;
union { null, long } updated = null;
string referenceName;
long start;
long end;
string referenceBases;
array<string> alternateBases = [];
map<array<string>> info = {};
array<Call> calls = [];
message Variant {
string id = 1;
string variant_set_id = 2;
repeated string names = 3;
int64 created = 4;
int64 updated = 5;
string reference_name = 6;
int64 start = 7;
int64 end = 8;
string reference_bases = 9;
repeated string alternate_bases = 10;
map<string, google.protobuf.ListValue> info = 11;
repeated Call calls = 12;
}

Here is a serialized variant in JSON. It's a bit of an edge case in some respects::
Expand Down Expand Up @@ -61,114 +61,8 @@ Here is a serialized variant in JSON. It's a bit of an edge case in some respect

Things to notice:
* A serialized record contains no explicit information about its type.
* Arrays are serialized as JSON arrays.
* "repeated" types are serialized as JSON arrays.
* Maps are serialized as JSON objects.
* Records are also serialized as JSON objects.
* Messages are also serialized as JSON objects.
* Enums (not shown here) are serialized as JSON strings.
* Nulls are serialized as JSON nulls.
* Fields with default values may be omitted (see the lack of an ``updated`` or ``calls``) as a way of serializing their default values.
* Unions of ``null`` and a non-``null`` type are serialized as either ``null`` or the serialized non-null value. No other kinds of unions are present or permitted.

-----------------------
A note on unions
-----------------------

As noted above, a field with union type serialized in GA4GH JSON looks no different from a field of any other type: you just put the field name and its recursively serialized value. In order for the Avro JSON libraries to support this, it is necessary that AVDL ``union`` types union together only ``null`` and a single non-``null`` type. If there were two or more non-``null`` types, the Avro libraries would need to include additional type information to say which to use when deserializing. Since we prohibit those unions, however, API clients and alternative server implementations never need to worry about this additional type information or its syntax. They can just handle "normal" JSON.

.. todo::
* add example of Python decoder output
* create a python class, if necessary

-----------------------
Wire protocol example
-----------------------

This is from the `ga4gh server example`_.

.. _ga4gh server example: http://ga4gh-reference-implementation.readthedocs.org/en/stable/demo.html#demo

To get information from the readgroupsets on a server, create a JSON format request::

{
"datasetIds":[],
"name":null
}

.. note::
What is this actually asking?

To send this to the server, we need to create a HTTP request which tells the server what type of
data to expect (JSON format, in this case)
In our test case, we have a server running at \http://localhost:8000

Since we want to query the readgroupsets, we'll have to make that part of the URL

.. note::
* How do we know it's v0.5.1?
* where is the readgroupsets/search part documented or defined?

To create a command line request, we can use `cURL <http://curl.haxx.se/>`_::

curl --data '{"datasetIds":[], "name":null}' --header 'Content-Type: application/json' http://localhost:8000/v0.5.1/readgroupsets/search

The server returns::

{
"nextPageToken": null,
"readGroupSets": [{
"readGroups": [{
"info": {},
"updated": 1432287597662,
"predictedInsertSize": null,
"description": null,
"created": 1432287597662,
"programs": [],
"sampleId": null,
"experiment": null,
"referenceSetId": null,
"id":
"low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522",
"datasetId": null,
"name":
"low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522"
},
{ "info": {},
"updated": 1432287793946,
"predictedInsertSize": null,
"description": null,
"created": 1432287793946,
"programs": [],
"sampleId": null,
"experiment": null,
"referenceSetId": null,
"id":
"low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522",
"datasetId": null,
"name":
"low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522"
},
{ "info": {},
"updated": 1432287793946,
"predictedInsertSize": null,
"description": null,
"created": 1432287793946,
"programs": [],
"sampleId": null,
"experiment": null,
"referenceSetId": null,
"id":
"low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522",
"datasetId": null,
"name":
"low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522"
}],
"id":
"low-coverage",
"datasetId": null,
"name": null
}
]
}



54 changes: 54 additions & 0 deletions doc/source/appendix/proto_intro.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
.. _proto:

***********************
Google Protocol Buffers
***********************

Apache Avro is a data serialization ecosystem, comparable to Google's Protocol Buffers.

-------------------------------------------------------
What does the GA4GH web API take from Protocol Buffers?
-------------------------------------------------------

The GA4GH web API uses the Google Protocol Buffers language and JSON serialization libraries.

The GA4GH web API presents a simple HTTP(S) and JSON interface to clients. It does **not** use Protocol Buffers's binary serialization format.

-------------------------------------------------------
How does the GA4GH web API use Protocol Buffer schemas?
-------------------------------------------------------

GA4GH web API objects, including both the data objects actually exchanged and the control messages requesting and returning those objects, are defined in Protocol Buffers.

The full documentation for the Protocol buffers language can be found `here <https://developers.google.com/protocol-buffers/docs/proto3>`_.

------------------------------------------------
How does the GA4GH Web API use Protocol Buffers?
------------------------------------------------

The GA4GH web API schemas are broken up into multiple proto files, which reference each other. Each file defines a number of message types, grouped into a "protocol" that defines a facet of the API. Mostly, the files come in pairs: a normal proto file defining the types representing actual data, and a "methods" proto file defining the control messages to be sent back and forth to query and exchange the representational types, and the URLs associated with various operations.

Each type has a leading comment documenting its purpose, and each field in the type has a description. These are included in the automatically generated API documentation.

Here is an example of a proto definition from , in this case defining a genomic `Position` type which is used across the API::

message Position {
// The name of the `Reference` on which the `Position` is located.
string reference_name = 1;

// The 0-based offset from the start of the forward strand for that
// `Reference`. Genomic positions are non-negative integers less than
// `Reference` length.
int64 position = 2;

// Strand the position is associated with.
Strand strand = 3;
}
This is a "message", which contains three fields. All of the fields are required to be filled in, and all of the fields can only hold objects of a particular single type. The last field holds a `Strand` object, which is defined elsewhere in the file.

.. todo::
* How much of the Protocol Buffers tutorial do we want in here?
* Document/show an example for methods (request and response pairing pattern)
* Talk about how we manually specify that some things land in URLs

Loading

0 comments on commit 8cc34de

Please sign in to comment.