Skip to content
Eric T. Dawson edited this page Sep 21, 2016 · 3 revisions

Basic operations

It is useful to think of vg as a toolbox for manipulating the internal variation graph representation. Almost any goal can be expressed in terms of operations on the graph; while developing this way of thinking initially requires some mental gymnastics, in the end it makes the user's life simpler by abstracting their problem out of the biological context and into a more general one. For example, variant calling can be expressed as the series of construct, index, map, and call operations, with inputs and outputs along the way.

        vg construct -r reference.fa -v variants.vcf.gz > graph.vg # Construct
        vg index -x graph.xg -g graph.gcsa -k 11 graph.vg          # Index
        vg map -f reads.fq -g graph.gcsa -x reads.xg > mapped.gam  # Map
        vg index -d mapped.gam.index -N mapped.gam                                     # Index (but a different kind)
        vg genotype graph.vg mapped.gam.index > calls.vg           # Genotype (a type of call operation)

Let's walk through a basic vg pipeline like the one above step-by-step to help illustrate what all it can do.

First things first: build your graph

There are two main ways to build a graph in vg. If you have a reference genome, you should generally start with vg construct. If you don't have a reference genome but do have some polished long reads, you could try a progressive assembly using vg msga. Two short command line examples follow.

        vg construct -r reference.fa -v variants.vcf.gz > graph.vg
        vg msga -f polished_contigs.fa > graph.vg

At its most basic, construct takes only a reference genome with the -r flag. This generates a linear graph, completely analogous to the linear reference used for input.

        vg construct -r reference.fa > ref.vg # a 'flat' graph, in vg parlance. One that contains no bubbles.

While all of the operations vg will operate on this graph, it wouldn't be particularly interesting (bwa could do the same things). What we would like to do is use the prior information from a population (e.g. as variant calls in a VCF) as a prior for mapping our reads. We hope that by doing so we'll get better mappings and better rediscovery of known variants. We can create such a graph like so if we have a VCF containing calls to the same reference we use as input:

      # We need to gzip compress and tabix-index our VCF, like so:
      # bgzip variants.vcf && tabix variants.vcf.gz
      vg construct -r reference.fa -v variants.vcf.gz > graph.vg

We'll save the details of msga for another wiki page, but in short we can imagine it doing the following steps:

  1. Make a 'flat' graph of the first read.
  2. Align the second read to this graph.
  3. Incorporate new paths introduced by the second read into the graph.
  4. Reindex the graph and continue the process with reads 3, 4, 5...N.

We've now seen how to create graphs, both flat ones and those containing reference variation. Now we'll talk about indexing, an essential step if we want to map reads.

Index your graph

Indexing a graph is directly analogous to indexing a linear reference with bwa index; we need to create ways to efficiently look up entities in the graph if we want to map reads.

There are two main kinds of index used by vg map, one used by vg genotype, and an old (semi-deprecated) type that we'll discuss.

The xg index

xg is a library that can represent the same information in a variation graph - nodes, edges, and paths - but in much less space. This allows us to fit most or even all of the graph structure in memory, speeding up the process of mapping reads. It also allows us to efficiently query paths in the graph, for example in the gPBWT where we may want to retrieve haplotypes (which are stored as paths). We need the xg index for mapping, and we construct it with the following command:

            vg index -x <name>.xg graph.vg

The GCSA2 index

GCSA2 stands for "Generalized Compressed Suffix Array", but you can think of it as a graph BWT index like that used by bwa. It allows fast lookup of the locations in the graph where specific subsequences are present - for example, if I wanted to know where the sequence "GATTACA" was in my graph, I could query the GCSA2 index, and it would provide me the location of the "GATTACA" sequence. The GCSA2 index requires a kmer size parameter as it's based on a pruned de Bruijn graph. We can construct a GCSA2 index like so:

            vg index -g index.gcsa -k 11 graph.vg

The GAM index

The GAM index allows us to find which alignments in our GAM map to specific nodes/edges and vice versa. We need this index for genotyping, as we perform this query in that command's source code. Here's how we make one:

           vg index -d mapped.gam.index -N reads.gam

This will produce an index call reads.gam.index, which shows up on your file system as a directory.

The old vg disk-backed index

Originally, vg used a RocksDB index to do what the xg and GCSA2 indices do. You could make one like this:

          vg index graph.vg

We don't use this anymore because it's slow and we have the high-performance xg/GCSA2 combo, but it still exists, and it still pops up in some places.

Mapping reads

Okay, we've now created a graph and generated our indices for mapping reads, so let's map some. There are a few ways to map reads, so we'll briefly review those before jumping in to recommended usage.

The old way of mapping used a kmer-based mapping scheme. This is pretty old-school and isn't as accurate as modern approaches based on maximal exact matches (a.k.a. MEMs). As such, this mapper has been more or less deprecated in favor of using MEMs, and we'll give examples using only MEMs.

To map reads with MEMs, we call vg map without passing a kmer size. Single end reads are passed with a single -f; paired ends require a -f for each file containing reads, like so:

          ## Single end reads
          vg map -x index.xg -g index.gcsa -f singles.fq > reads.gam
          ## Paired end reads
          vg map -x index.xg -g index.gcsa -f reads1.fq -f reads2.fq > reads.gam

The mappings are output in GAM format, an analogue of BAM for graphs.

Calling variants with call and genotype

There are two variant callers in vg, one relying on a samtools pileup style structure and a more advanced one that behaves more like FreeBayes, termed genotype. call deserves its own wiki page, so we'll leave that for later. We'll talk about the genotyper here. The genotyper relies on the GAM index and the graph itself. It works by inserting all paths implied by reads into the graph and then looking for bubbles, then doing some logic to determine which side(s) of the bubble the reads support. You can run it like this:

         vg genotype -v graph.vg mapped.gam.index/ > calls.vcf

This will barf a bunch of info to the terminal that relates to support for individual calls. It will also produce a VCF, but the coordinates won't be sorted, so you might want to sort them with vcflib.

At this point, you've gone from a reference genome, some population variation and a set of reads to genotyped calls.