Skip to content

Commit

Permalink
Improved build, added license, file converters, module framework and …
Browse files Browse the repository at this point in the history
…README.
  • Loading branch information
massie authored and Matt Massie committed Apr 21, 2013
1 parent b4563d6 commit 138748e
Show file tree
Hide file tree
Showing 19 changed files with 1,404 additions and 43 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
.idea
*.iml
target
adam*.jar
308 changes: 308 additions & 0 deletions LICENSE.txt

Large diffs are not rendered by default.

147 changes: 146 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,149 @@
ADAM
====

Avro Datafile for Alignment/Mapping (ADAM)
*[Avro](http://avro.apache.org/) Datafile for Alignment/Mapping (ADAM)*

# Introduction

ADAM is a file format as well as a light-weight framework for doing Genome Analysis.

## The ADAM framework

ADAM is written to be modular. To create a new module, simply extend the
[AdamModule](src/main/java/edu/berkeley/amplab/adam/modules/AdamModule.java) abstract
class and define your module options using [args4j](http://args4j.kohsuke.org/). The
[CountReads](src/main/java/edu/berkeley/amplab/adam/modules/CountReads.java) class
is a good simple example to look at. Add your module to the `AdamMain` class and it
will appear in the module list. In the future, ADAM will support dynamically loaded
modules.

## ADAM File Format

The ADAM file format is an improvement on the SAM or BAM file formats in a number of ways:

1. The ADAM file format is easily splittable for distributed processing with Hadoop
2. The ADAM file format is completely self-contained. Each read includes the reference
information.
3. The ADAM file format is [defined in the Avro IDL](src/main/resources/avro/protocol.avdl)
that makes it easy to create implementations in many different computer languages. This schema
is stored in the header of each ADAM file to ensure the data is self-descriptive.
4. The ADAM file format is compact. It holds more information about each read (e.g. reference
name, reference length) while still being about the same size as a BAM file. You can, of course, increase
the compression level to make an ADAM file smaller than a BAM file at the cost of encoding time.
5. The ADAM file has all the information needed to encode the data later as a SAM/BAM file if needed.
The entire SAM header is stored in the Avro meta-data with key `sam.header`.
6. The ADAM file format can be viewed in human-readable form as JSON using Avro tools

# Getting Started

## Installation

You will need to have [Maven](http://maven.apache.org/) installed in order to build this project.
You will need to have [Hadoop](http://hadoop.apache.org/) or
[CDH](http://www.cloudera.com/content/cloudera/en/products/cdh.html) installed in order to run it.

```
$ git clone [email protected]:massie/adam.git
$ cd adam
$ mvn package
```

Maven will create a self-executing jar, e.g. adam-X.Y.jar, in the project root that is ready to be
used with Hadoop.

## Running ADAM

To see all the available ADAM modules, run the following command:

```
$ bin/hadoop jar adam-X.Y.jar
```

You will receive a listing of all modules and how to launch them. The commandline syntax to
run a module is:

```
$ bin/hadoop jar adam-X.Y.jar [generic Hadoop options] moduleName [module options]
```

For example, let's say we wanted to convert a SAM/BAM file to an ADAM file and upload it on-the-fly, you
would use a commandline similar to the following:

```
$ bin/hadoop jar /workspace/adam/adam-0.1-SNAPSHOT.jar \
-conf ~/.whirr/testcluster/hadoop-site.xml \
convert -input NA12878_chr20.bam -output /user/matt/NA12878_chr20.avro
```

This will convert and `NA12878_chr20.bam` file and send it to `/user/matt/NA12878_chr20.avro` directly.
To see all the options for the `convert` module, run the it without any options, e.g.

```
$ bin/hadoop jar adam-X.Y.jar convert
```

## A step-by-step example

This example will show you how to convert a BAM file to an ADAM file and then count the number of reads.

First, we need to convert the BAM file to an ADAM file and upload it our Hadoop cluster.

```
$ bin/hadoop jar /workspace/adam/adam-0.1-SNAPSHOT.jar \
-conf ~/.whirr/testcluster/hadoop-site.xml \
convert -input NA12878_chr20.bam -output /user/matt/NA12878_chr20.avro
```
Note that you can also use the `HADOOP_CONF_DIR` variable if you like instead of the `-config` generic option.

ADAM will provide feedback about the reference being converted as well as the locus. When it finishes,
you should see a message similar to `X secs to convert Y reads`.

Now that your ADAM file stored in Hadoop, you can run analysis on it. Let's count the number
of reads per reference in the ADAM file using the `count_reads` module.

```
$ bin/hadoop jar /workspace/adam/adam-0.1-SNAPSHOT.jar \
-conf ~/.whirr/testcluster/hadoop-site.xml \
count_reads -input /user/matt/NA12878_chr20.avro -output /user/matt/results
```

The `results` directory will contain the output of the reducer, e.g.

```
$ bin/hadoop fs -ls /user/matt/results
/user/matt/results/_SUCCESS
/user/matt/results/part-00000.avro
```

Let's look at the content of the results.

```
$ bin/hadoop fs -get /user/matt/results/part-00000.avro .
$ avrotools tojson /tmp/results/part-00000.avro
{"key":"chr20","value":51554029}
```

This ADAM file had 51554029 reads on a single reference `chr20` (chromosome 20). Note that `avrotools` is
included with the [Apache Avro](http://avro.apache.org/) distribution.

The results are stored as an Avro file to make it easy to use as input to another job.

# License

ADAM is released under an [Apache 2.0 license](LICENSE.txt).

# Future Work

If you're interested in helping with this project, here are things to do. Feel free to fork away and send
me a pull request.

* Add ability to run GATK walkers inside modules (I have a good idea how to do this. Protyping now.).
* Write tests
* Support dynamically loaded modules
* Possibly support side-loading reference information
* Processing of optional attributes

# Support

Feel free to contact me directly if you have any questions about ADAM. My email address is `[email protected]`.

83 changes: 74 additions & 9 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,31 @@
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>bamavro</groupId>
<artifactId>bamavro</artifactId>
<version>1.0-SNAPSHOT</version>
<groupId>edu.berkeley.amplab</groupId>
<artifactId>adam</artifactId>
<version>0.1-SNAPSHOT</version>
<packaging>jar</packaging>

<properties>
<avro.version>1.7.4</avro.version>
<hadoop.version>1.1.2</hadoop.version>
</properties>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
<executions>
<execution>
<id>schemas</id>
Expand All @@ -27,8 +39,9 @@
<goal>idl-protocol</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/edu/berkeley/amplab/bamavro/avro</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/edu/berkeley/amplab/bamavro/avro</outputDirectory>
<sourceDirectory>${project.basedir}/src/main/resources/avro
</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java</outputDirectory>
<!--
<testSourceDirectory>${project.basedir}/src/test/avro/</testSourceDirectory>
<testOutputDirectory>${project.basedir}/src/test/java/</testOutputDirectory>
Expand All @@ -37,20 +50,72 @@
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2.1</version>
<executions>
<execution>
<id>job</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<descriptors>
<descriptor>src/main/assembly/job.xml</descriptor>
</descriptors>
<finalName>adam-${project.version}</finalName>
<outputDirectory>${project.build.directory}/../</outputDirectory>
<appendAssemblyId>false</appendAssemblyId>
<attach>false</attach>
<archive>
<manifest>
<mainClass>edu.berkeley.amplab.adam.AdamMain</mainClass>
</manifest>
</archive>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>

<dependencies>
<!-- Picard -->
<dependency>
<groupId>org.utgenome.thirdparty</groupId>
<artifactId>picard</artifactId>
<version>1.86.0</version>
</dependency>
<!-- Avro -->
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>${avro.version}</version>
</dependency>
<dependency>
<groupId>org.utgenome.thirdparty</groupId>
<artifactId>picard</artifactId>
<version>1.86.0</version>
<groupId>org.apache.avro</groupId>
<artifactId>avro-mapred</artifactId>
<version>${avro.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>args4j</groupId>
<artifactId>args4j</artifactId>
<version>2.0.23</version>
</dependency>

<!--
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.2</version>
</dependency>
-->
</dependencies>

</project>
32 changes: 32 additions & 0 deletions src/main/assembly/job.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2 http://maven.apache.org/xsd/assembly-1.1.2.xsd">
<id>job</id>
<formats>
<format>jar</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<dependencySets>
<dependencySet>
<useTransitiveFiltering>true</useTransitiveFiltering>
<unpack>true</unpack>
<unpackOptions>
<excludes>
<exclude>META-INF/LICENSE</exclude> <!-- Exclude to avoid clash with directory of the same name -->
</excludes>
</unpackOptions>
<scope>runtime</scope>
<outputDirectory>/</outputDirectory>
<useProjectArtifact>false</useProjectArtifact>
<excludes>
<exclude>org.apache.hadoop:*</exclude>
</excludes>
</dependencySet>
</dependencySets>
<fileSets>
<fileSet>
<directory>target/classes</directory>
<outputDirectory>/</outputDirectory>
</fileSet>
</fileSets>
</assembly>
1 change: 1 addition & 0 deletions src/main/java/edu/berkeley/amplab/adam/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
avro
Loading

0 comments on commit 138748e

Please sign in to comment.