Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Commit

Permalink
Merge pull request #547 from ga4gh/protobuf
Browse files Browse the repository at this point in the history
Convert avdl -> proto3
  • Loading branch information
david4096 committed Jun 2, 2016
2 parents a12e3fd + 8cc34de commit 5ccf897
Show file tree
Hide file tree
Showing 39 changed files with 1,982 additions and 2,898 deletions.
50 changes: 4 additions & 46 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,54 +118,12 @@ Syntax Style and Conventions

The current code conventions for the source files are as follows:

- Follow the `protocol buffers style guide
<https://developers.google.com/protocol-buffers/docs/style>`__
- Use two-space indentation, and no tabs.
- Hard-wrap code to 80 characters per line.
- Use ``UpperCamelCase`` for object or record names.
- Use ``lowerCamelCase`` for attribute or method names.
- Use ``CONSTANT_CASE`` for global and constant values.
- Comments:

- Comments should be indented at the same level as the surrounding
code.
- Comments should precede the code that they make a comment on.
Documentation comments will not work otherwise.
- Documentation comments, which are intended to be processed by
avrodoc and displayed in the user-facing API documentation, must use
the ``/** ... */`` style, and must not have a leading ``*`` on each
internal line:

::

/**
This documentation comment will be
processed correctly by avrodoc.
*/

::

/**
* This documentation comment will have a
* bullet point at the start of every line
* when processed by avrodoc.
*/

- Block and multi-line non-documentation comments, intended for schema
developers only, must use the ``/* ... */`` style.

::

/*
This multi-line comment will not appear in the
avrodoc documentation and is intended for
schema developers.
*/

- All multi-line comments should have the comment text at the same
indent level as the comment delimeters.
- One-line non-documentation comments, intended for schema developers
only, must use the ``// ...`` style.
- Comments may use `reStructuredText
<http://docutils.sourceforge.net/rst.html>`__ mark up.
- Comments may use `reStructuredText
<http://docutils.sourceforge.net/rst.html>`__ mark up.

Documentation
@@@@@@@@@@@@@
Expand Down
14 changes: 8 additions & 6 deletions INSTALL.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Installing the GA4GH Schemas
The schemas are documents (text files) that formally describe the
messages that pass between GA4GH reference servers and clients, which we
also refer to collectively as "the API." The schemas are written in a
language called `Avro <http://avro.apache.org>`__.
language called `Protocol Buffers <https://developers.google.com/protocol-buffers/>`__.

We use the schemas in a couple of different ways:

Expand All @@ -14,19 +14,21 @@ We use the schemas in a couple of different ways:
Generating Source Code
@@@@@@@@@@@@@@@@@@@@@@

(To be written.)
::

$ cd src/main/proto && protoc --python_out=. ga4gh/*

Installing the Documentation Tools and Generating Documentation
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

We use a tool called Sphinx to generate the documentation from Avro
input files.
We use a tool called Sphinx to generate the documentation from Protocol
Buffers input files.

Install prerequisites
#####################

To use the Sphinx/Avro documentation generator, you must install some
software packages it requires.
To use the Sphinx/Protocol Buffers documentation generator, you must
install some software packages it requires.

Maven
$$$$$
Expand Down
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ primary data collected from sequencing machines.
The team will deliver:

#. Data model. An abstract, mathematically complete and precise model of
the data that is manipulated by the API. See the `Avro
directory <src/main/resources/avro>`__ for our in-progress work on
the data that is manipulated by the API. See the `Proto
directory <src/main/proto>`__ for our in-progress work on
defining v0.5 of the data model.

#. API Specification. A human-readable document introducing and
Expand Down
24 changes: 0 additions & 24 deletions doc/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,28 +23,6 @@ prerequisites and, for the moment, you'll have to ferret those out
yourself.)


Building Process
@@@@@@@@@@@@@@@@

The current doc flow is roughly as follows::

avdl ----1----> avpr ----2----> rst -|
| ----3----> html
rst -|

|- doc/source/schema/Makefile -| |-- sphinx --|
|------ top-level Makefile ('make docs') --------|
* 1 = avro-tools, downloaded on demand; requires java
* 2 = avpr2rest.py, a custom script in tools/sphinx/
* 3 = sphinx-build, part of the sphinx package

.. warning:: Because we cannot currently run step 1 at Read the Docs,
it is imperative that developers type `make docs-schema`
at the top level if avdl files are updated, and then
commit the changed rst files.


Documentation tips
@@@@@@@@@@@@@@@@@@

Expand All @@ -54,5 +32,3 @@ Documents are written in `ReStructured Text
schemas.

- Abbreviations are stored in ``epilog.rst``.
- Reference avro elements with ``:avro:key``.

6 changes: 3 additions & 3 deletions doc/source/api/apidesign_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,9 @@ Unresolved Issues

* What is the definition of the wire protocol? HTTP 1.0? Is HTTP 1.1
chunked encoding allowed? What is the specification for the
generate JSON for a given an Avro schema?
generated JSON for a given an Protocol Buffers schema?

* What is the role of Avro? Is it for documentation-only or for use
as an IDL?
* What is the role of Protocol Buffers? Is it for documentation-only
or for use as an IDL?

* Need overall object relationship diagram.
72 changes: 0 additions & 72 deletions doc/source/appendix/avro_intro.rst

This file was deleted.

140 changes: 17 additions & 123 deletions doc/source/appendix/json_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,33 @@ The JSON Format

JSON, or JavaScript Object Notation, is officially defined `here <http://json.org/example>`_. It is the standard data interchange format for web APIs.

The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its AVDL schemas. More information on the AVDL schemas is available in :ref:`avro`; basically, the AVDL type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them.
The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its Protocol Buffers schemas. More information on the schemas is available in :ref:`proto`; basically, the Protocol Buffers type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them.

-----------------------
GA4GH JSON Serialization
-----------------------

The GA4GH web APIs use Avro IDL to define their schemas, and use the associated Avro JSON serialization libraries. Since the schemas use a restricted subset of AVDL types (see `A note on unions`_ below), the serialized JSON format is fairly standard. This means that standard non-Avro JSON serialization and deserialization libraries (like, for example, the Python ``json`` module) can be used to serialize and deserialize GA4GH JSON messages in an idiomatic way.
The GA4GH web APIs use Protocol Buffers IDL to define their schemas, and use the associated Google Protocol Buffers JSON serialization libraries. Notice that the Protocol Buffers IDL uses snake case, while the on-the-wire protocol is in camel case.

---------------------
Serialization example
---------------------

For example, here is the schema definition for Variants (with comments removed)::

record Variant {
string id;
string variantSetId;
array<string> names = [];
union { null, long } created = null;
union { null, long } updated = null;
string referenceName;
long start;
long end;
string referenceBases;
array<string> alternateBases = [];
map<array<string>> info = {};
array<Call> calls = [];
message Variant {
string id = 1;
string variant_set_id = 2;
repeated string names = 3;
int64 created = 4;
int64 updated = 5;
string reference_name = 6;
int64 start = 7;
int64 end = 8;
string reference_bases = 9;
repeated string alternate_bases = 10;
map<string, google.protobuf.ListValue> info = 11;
repeated Call calls = 12;
}

Here is a serialized variant in JSON. It's a bit of an edge case in some respects::
Expand Down Expand Up @@ -61,114 +61,8 @@ Here is a serialized variant in JSON. It's a bit of an edge case in some respect

Things to notice:
* A serialized record contains no explicit information about its type.
* Arrays are serialized as JSON arrays.
* "repeated" types are serialized as JSON arrays.
* Maps are serialized as JSON objects.
* Records are also serialized as JSON objects.
* Messages are also serialized as JSON objects.
* Enums (not shown here) are serialized as JSON strings.
* Nulls are serialized as JSON nulls.
* Fields with default values may be omitted (see the lack of an ``updated`` or ``calls``) as a way of serializing their default values.
* Unions of ``null`` and a non-``null`` type are serialized as either ``null`` or the serialized non-null value. No other kinds of unions are present or permitted.

-----------------------
A note on unions
-----------------------

As noted above, a field with union type serialized in GA4GH JSON looks no different from a field of any other type: you just put the field name and its recursively serialized value. In order for the Avro JSON libraries to support this, it is necessary that AVDL ``union`` types union together only ``null`` and a single non-``null`` type. If there were two or more non-``null`` types, the Avro libraries would need to include additional type information to say which to use when deserializing. Since we prohibit those unions, however, API clients and alternative server implementations never need to worry about this additional type information or its syntax. They can just handle "normal" JSON.

.. todo::
* add example of Python decoder output
* create a python class, if necessary

-----------------------
Wire protocol example
-----------------------

This is from the `ga4gh server example`_.

.. _ga4gh server example: http://ga4gh-reference-implementation.readthedocs.org/en/stable/demo.html#demo

To get information from the readgroupsets on a server, create a JSON format request::

{
"datasetIds":[],
"name":null
}

.. note::
What is this actually asking?

To send this to the server, we need to create a HTTP request which tells the server what type of
data to expect (JSON format, in this case)
In our test case, we have a server running at \http://localhost:8000

Since we want to query the readgroupsets, we'll have to make that part of the URL

.. note::
* How do we know it's v0.5.1?
* where is the readgroupsets/search part documented or defined?

To create a command line request, we can use `cURL <http://curl.haxx.se/>`_::

curl --data '{"datasetIds":[], "name":null}' --header 'Content-Type: application/json' http://localhost:8000/v0.5.1/readgroupsets/search

The server returns::

{
"nextPageToken": null,
"readGroupSets": [{
"readGroups": [{
"info": {},
"updated": 1432287597662,
"predictedInsertSize": null,
"description": null,
"created": 1432287597662,
"programs": [],
"sampleId": null,
"experiment": null,
"referenceSetId": null,
"id":
"low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522",
"datasetId": null,
"name":
"low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522"
},
{ "info": {},
"updated": 1432287793946,
"predictedInsertSize": null,
"description": null,
"created": 1432287793946,
"programs": [],
"sampleId": null,
"experiment": null,
"referenceSetId": null,
"id":
"low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522",
"datasetId": null,
"name":
"low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522"
},
{ "info": {},
"updated": 1432287793946,
"predictedInsertSize": null,
"description": null,
"created": 1432287793946,
"programs": [],
"sampleId": null,
"experiment": null,
"referenceSetId": null,
"id":
"low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522",
"datasetId": null,
"name":
"low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522"
}],
"id":
"low-coverage",
"datasetId": null,
"name": null
}
]
}



Loading

0 comments on commit 5ccf897

Please sign in to comment.