Merge pull request #547 from ga4gh/protobuf

Convert avdl -> proto3
ga4gh · Jun 2, 2016 · 5ccf897 · 5ccf897
2 parents a12e3fd + 8cc34de
commit 5ccf897
Show file tree

Hide file tree

Showing 39 changed files with 1,982 additions and 2,898 deletions.
diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -118,54 +118,12 @@ Syntax Style and Conventions
 
 The current code conventions for the source files are as follows:
 
+- Follow the `protocol buffers style guide
+  <https://developers.google.com/protocol-buffers/docs/style>`__
 - Use two-space indentation, and no tabs.
 - Hard-wrap code to 80 characters per line.
-- Use ``UpperCamelCase`` for object or record names.
-- Use ``lowerCamelCase`` for attribute or method names.
-- Use ``CONSTANT_CASE`` for global and constant values.
-- Comments:
-
-  - Comments should be indented at the same level as the surrounding
-    code.
-  - Comments should precede the code that they make a comment on.
-    Documentation comments will not work otherwise.
-  - Documentation comments, which are intended to be processed by
-    avrodoc and displayed in the user-facing API documentation, must use
-    the ``/** ... */`` style, and must not have a leading ``*`` on each
-    internal line:
-
-    ::
-
-       /**
-       This documentation comment will be
-       processed correctly by avrodoc.
-       */
-
-    ::
-
-       /**
-       * This documentation comment will have a
-       * bullet point at the start of every line
-       * when processed by avrodoc.
-       */
-
-  - Block and multi-line non-documentation comments, intended for schema
-    developers only, must use the ``/* ... */`` style.
-
-    ::
-
-       /*
-       This multi-line comment will not appear in the
-       avrodoc documentation and is intended for
-       schema developers.
-       */
-
-  - All multi-line comments should have the comment text at the same
-    indent level as the comment delimeters.
-  - One-line non-documentation comments, intended for schema developers
-    only, must use the ``// ...`` style.
-  - Comments may use `reStructuredText
-    <http://docutils.sourceforge.net/rst.html>`__ mark up.
+- Comments may use `reStructuredText
+  <http://docutils.sourceforge.net/rst.html>`__ mark up.
 
 Documentation
 @@@@@@@@@@@@@

diff --git a/INSTALL.rst b/INSTALL.rst
@@ -4,7 +4,7 @@ Installing the GA4GH Schemas
 The schemas are documents (text files) that formally describe the
 messages that pass between GA4GH reference servers and clients, which we
 also refer to collectively as "the API." The schemas are written in a
-language called `Avro <http://avro.apache.org>`__.
+language called `Protocol Buffers <https://developers.google.com/protocol-buffers/>`__.
 
 We use the schemas in a couple of different ways:
 
@@ -14,19 +14,21 @@ We use the schemas in a couple of different ways:
 Generating Source Code
 @@@@@@@@@@@@@@@@@@@@@@
 
-(To be written.)
+::
+
+$ cd src/main/proto && protoc --python_out=. ga4gh/*
 
 Installing the Documentation Tools and Generating Documentation
 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 
-We use a tool called Sphinx to generate the documentation from Avro
-input files.
+We use a tool called Sphinx to generate the documentation from Protocol
+Buffers input files.
 
 Install prerequisites
 #####################
 
-To use the Sphinx/Avro documentation generator, you must install some
-software packages it requires.
+To use the Sphinx/Protocol Buffers documentation generator, you must
+install some software packages it requires.
 
 Maven
 $$$$$

diff --git a/README.rst b/README.rst
@@ -31,8 +31,8 @@ primary data collected from sequencing machines.
 The team will deliver:
 
 #. Data model. An abstract, mathematically complete and precise model of
-   the data that is manipulated by the API. See the `Avro
-   directory <src/main/resources/avro>`__ for our in-progress work on
+   the data that is manipulated by the API. See the `Proto
+   directory <src/main/proto>`__ for our in-progress work on
    defining v0.5 of the data model.
 
 #. API Specification. A human-readable document introducing and

diff --git a/doc/README.rst b/doc/README.rst
@@ -23,28 +23,6 @@ prerequisites and, for the moment, you'll have to ferret those out
 yourself.)
 
 
-Building Process
-@@@@@@@@@@@@@@@@
-
-The current doc flow is roughly as follows::
-
-  avdl ----1----> avpr ----2----> rst -| 
-                                       | ----3----> html
-                                  rst -|
-
-  |- doc/source/schema/Makefile  -| |--  sphinx  --|
-  |------ top-level Makefile ('make docs') --------|
-                    
-* 1 = avro-tools, downloaded on demand; requires java
-* 2 = avpr2rest.py, a custom script in tools/sphinx/
-* 3 = sphinx-build, part of the sphinx package
-
-.. warning:: Because we cannot currently run step 1 at Read the Docs,
-             it is imperative that developers type `make docs-schema`
-             at the top level if avdl files are updated, and then
-             commit the changed rst files.
-
-
 Documentation tips
 @@@@@@@@@@@@@@@@@@
 
@@ -54,5 +32,3 @@ Documents are written in `ReStructured Text
 schemas.
 
 - Abbreviations are stored in ``epilog.rst``.
-- Reference avro elements with ``:avro:key``.
-
diff --git a/doc/source/api/apidesign_intro.rst b/doc/source/api/apidesign_intro.rst
@@ -103,9 +103,9 @@ Unresolved Issues
 
 * What is the definition of the wire protocol?  HTTP 1.0? Is HTTP 1.1
   chunked encoding allowed?  What is the specification for the
-  generate JSON for a given an Avro schema?
+  generated JSON for a given an Protocol Buffers schema?
 
-* What is the role of Avro?  Is it for documentation-only or for use
-  as an IDL?
+* What is the role of Protocol Buffers?  Is it for documentation-only
+ or for use as an IDL?
 
 * Need overall object relationship diagram.
diff --git a/doc/source/appendix/avro_intro.rst b/doc/source/appendix/avro_intro.rst
diff --git a/doc/source/appendix/json_intro.rst b/doc/source/appendix/json_intro.rst
@@ -6,33 +6,33 @@ The JSON Format
 
 JSON, or JavaScript Object Notation, is officially defined `here <http://json.org/example>`_. It is the standard data interchange format for web APIs.
 
-The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its AVDL schemas. More information on the AVDL schemas is available in :ref:`avro`; basically, the AVDL type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them.
+The GA4GH Web API uses a JSON wire protocol, exchanging JSON representations of the objects defined in its Protocol Buffers schemas. More information on the schemas is available in :ref:`proto`; basically, the Protocol Buffers type definitions say what attributes any given JSON object ought to have, and what ought to be stored in each of them.
 
 -----------------------
 GA4GH JSON Serialization
 -----------------------
 
-The GA4GH web APIs use Avro IDL to define their schemas, and use the associated Avro JSON serialization libraries. Since the schemas use a restricted subset of AVDL types (see `A note on unions`_ below), the serialized JSON format is fairly standard. This means that standard non-Avro JSON serialization and deserialization libraries (like, for example, the Python ``json`` module) can be used to serialize and deserialize GA4GH JSON messages in an idiomatic way.
+The GA4GH web APIs use Protocol Buffers IDL to define their schemas, and use the associated Google Protocol Buffers JSON serialization libraries. Notice that the Protocol Buffers IDL uses snake case, while the on-the-wire protocol is in camel case.
 
 ---------------------
 Serialization example
 ---------------------
 
 For example, here is the schema definition for Variants (with comments removed)::
 
-  record Variant {
-    string id;
-    string variantSetId;
-    array<string> names = [];
-    union { null, long } created = null;
-    union { null, long } updated = null;
-    string referenceName;
-    long start;
-    long end;
-    string referenceBases;
-    array<string> alternateBases = [];
-    map<array<string>> info = {};
-    array<Call> calls = [];
+  message Variant {
+    string id = 1;
+    string variant_set_id = 2;
+    repeated string names = 3;
+    int64 created = 4;
+    int64 updated = 5;
+    string reference_name = 6;
+    int64 start = 7;
+    int64 end = 8;
+    string reference_bases = 9;
+    repeated string alternate_bases = 10;
+    map<string, google.protobuf.ListValue> info = 11;
+    repeated Call calls = 12;
   }
 
 Here is a serialized variant in JSON. It's a bit of an edge case in some respects::
@@ -61,114 +61,8 @@ Here is a serialized variant in JSON. It's a bit of an edge case in some respect
 
 Things to notice:
  * A serialized record contains no explicit information about its type.
- * Arrays are serialized as JSON arrays.
+ * "repeated" types are serialized as JSON arrays.
  * Maps are serialized as JSON objects.
- * Records are also serialized as JSON objects.
+ * Messages are also serialized as JSON objects.
  * Enums (not shown here) are serialized as JSON strings.
- * Nulls are serialized as JSON nulls.
- * Fields with default values may be omitted (see the lack of an ``updated`` or ``calls``) as a way of serializing their default values.
- * Unions of ``null`` and a non-``null`` type are serialized as either ``null`` or the serialized non-null value. No other kinds of unions are present or permitted.
-
------------------------
-A note on unions
------------------------
-
-As noted above, a field with union type serialized in GA4GH JSON looks no different from a field of any other type: you just put the field name and its recursively serialized value. In order for the Avro JSON libraries to support this, it is necessary that AVDL ``union`` types union together only ``null`` and a single non-``null`` type. If there were two or more non-``null`` types, the Avro libraries would need to include additional type information to say which to use when deserializing. Since we prohibit those unions, however, API clients and alternative server implementations never need to worry about this additional type information or its syntax. They can just handle "normal" JSON.
-
-.. todo::
-   * add example of Python decoder output
-   * create a python class, if necessary
-
------------------------
-Wire protocol example
------------------------
-
-This is from the `ga4gh server example`_.
-
-.. _ga4gh server example: http://ga4gh-reference-implementation.readthedocs.org/en/stable/demo.html#demo
-
-To get information from the readgroupsets on a server, create a JSON format request::
-
-    {
-      "datasetIds":[], 
-      "name":null
-    }
-
-.. note::
-    What is this actually asking?
-
-To send this to the server, we need to create a HTTP request which tells the server what type of
-data to expect (JSON format, in this case)
-In our test case, we have a server running at \http://localhost:8000
-
-Since we want to query the readgroupsets, we'll have to make that part of the URL
-
-.. note::
-     * How do we know it's v0.5.1?
-     * where is the readgroupsets/search part documented or defined?
-
-To create a command line request, we can use `cURL <http://curl.haxx.se/>`_::
-
-    curl --data '{"datasetIds":[], "name":null}' --header 'Content-Type: application/json' http://localhost:8000/v0.5.1/readgroupsets/search
-
-The server returns::
-
-    {
-    "nextPageToken": null,
-    "readGroupSets": [{
-    "readGroups": [{
-    "info": {}, 
-    "updated": 1432287597662, 
-    "predictedInsertSize": null, 
-    "description": null, 
-    "created": 1432287597662, 
-    "programs": [], 
-    "sampleId": null, 
-    "experiment": null,
-    "referenceSetId": null,
-    "id":
-    "low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522",
-    "datasetId": null,
-    "name":
-    "low-coverage:HG00533.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522"
-    }, 
-    {   "info": {},
-    "updated": 1432287793946,
-    "predictedInsertSize": null,
-    "description": null,
-    "created": 1432287793946,
-    "programs": [],
-    "sampleId": null,
-    "experiment": null,
-    "referenceSetId": null,
-    "id":
-    "low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522",
-    "datasetId": null,
-    "name":
-    "low-coverage:HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522"
-    }, 
-    {    "info": {},
-    "updated": 1432287793946,
-    "predictedInsertSize": null,
-    "description": null,
-    "created": 1432287793946,
-    "programs": [],
-    "sampleId": null,
-    "experiment": null,
-    "referenceSetId": null,
-    "id":
-    "low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522",
-    "datasetId": null,
-    "name":
-    "low-coverage:HG00534.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522"
-    }],
-    "id":
-    "low-coverage",
-    "datasetId": null,
-    "name": null
-    }
-    ]
-    }
-
-