-
Notifications
You must be signed in to change notification settings - Fork 0
Basic Operations
It is useful to think of vg
as a toolbox for manipulating the internal variation graph representation. Almost any goal can be expressed in terms of operations on the graph; while developing this way of thinking initially requires some mental gymnastics, in the end it makes the user's life simpler by abstracting their problem out of the biological context and into a more general one. For example, variant calling can be expressed as the series of construct
, index
, map
, and call
operations, with inputs and outputs along the way.
vg construct -r reference.fa -v variants.vcf.gz > graph.vg # Construct
vg index -x graph.xg -g graph.gcsa -k 11 graph.vg # Index
vg map -f reads.fq -g graph.gcsa -x reads.xg > mapped.gam # Map
vg index -d mapped.gam.index -N mapped.gam # Index (but a different kind)
vg genotype graph.vg mapped.gam.index > calls.vg # Genotype (a type of call operation)
Let's walk through a basic vg
pipeline like the one above step-by-step to help illustrate what all it can do.
There are two main ways to build a graph in vg. If you have a reference genome, you should generally start with vg construct
. If you don't have a reference genome but do have some polished long reads, you could try a progressive assembly using vg msga
. Two short command line examples follow.
vg construct -r reference.fa -v variants.vcf.gz > graph.vg
vg msga -f polished_contigs.fa > graph.vg
At its most basic, construct
takes only a reference genome with the -r
flag. This generates a linear graph, completely analogous to the linear reference used for input.
vg construct -r reference.fa > ref.vg # a 'flat' graph, in vg parlance. One that contains no bubbles.
While all of the operations vg
will operate on this graph, it wouldn't be particularly interesting (bwa
could do the same things). What we would like to do is use the prior information from a population (e.g. as variant calls in a VCF) as a prior for mapping our reads. We hope that by doing so we'll get better mappings and better rediscovery of known variants. We can create such a graph like so if we have a VCF containing calls to the same reference we use as input:
# We need to gzip compress and tabix-index our VCF, like so:
# bgzip variants.vcf && tabix variants.vcf.gz
vg construct -r reference.fa -v variants.vcf.gz > graph.vg
We'll save the details of msga
for another wiki page, but in short we can imagine it doing the following steps:
- Make a 'flat' graph of the first read.
- Align the second read to this graph.
- Incorporate new paths introduced by the second read into the graph.
- Reindex the graph and continue the process with reads 3, 4, 5...N.
We've now seen how to create graphs, both flat ones and those containing reference variation. Now we'll talk about indexing, an essential step if we want to map reads.
Indexing a graph is directly analogous to indexing a linear reference with bwa index
; we need to create ways to efficiently look up entities in the graph if we want to map reads.
There are two main kinds of index used by vg map
, one used by vg genotype
, and an old (semi-deprecated) type that we'll discuss.
xg
is a library that can represent the same information in a variation graph - nodes, edges, and paths - but in much less space. This allows us to fit most or even all of the graph structure in memory, speeding up the process of mapping reads. It also allows us to efficiently query paths in the graph, for example in the gPBWT where we may want to retrieve haplotypes (which are stored as paths). We need the xg index for mapping, and we construct it with the following command:
vg index -x <name>.xg graph.vg
GCSA2 stands for "Generalized Compressed Suffix Array", but you can think of it as a graph BWT index like that used by bwa
. It allows fast lookup of the locations in the graph where specific subsequences are present - for example, if I wanted to know where the sequence "GATTACA" was in my graph, I could query the GCSA2 index, and it would provide me the location of the "GATTACA" sequence. The GCSA2 index requires a kmer size parameter as it's based on a pruned de Bruijn graph. We can construct a GCSA2 index like so:
vg index -g index.gcsa -k 11 graph.vg
The GAM index allows us to find which alignments in our GAM map to specific nodes/edges and vice versa. We need this index for genotyping, as we perform this query in that command's source code. Here's how we make one:
vg index -d mapped.gam.index -N reads.gam
This will produce an index call reads.gam.index
, which shows up on your file system as a directory.
Originally, vg used a RocksDB index to do what the xg and GCSA2 indices do. You could make one like this:
vg index graph.vg
We don't use this anymore because it's slow and we have the high-performance xg/GCSA2 combo, but it still exists, and it still pops up in some places.
Okay, we've now created a graph and generated our indices for mapping reads, so let's map some. There are a few ways to map reads, so we'll briefly review those before jumping in to recommended usage.
The old way of mapping used a kmer-based mapping scheme. This is pretty old-school and isn't as accurate as modern approaches based on maximal exact matches (a.k.a. MEMs). As such, this mapper has been more or less deprecated in favor of using MEMs, and we'll give examples using only MEMs.
To map reads with MEMs, we call vg map
without passing a kmer size. Single end reads are passed with a single -f
; paired ends require a -f
for each file containing reads, like so:
## Single end reads
vg map -x index.xg -g index.gcsa -f singles.fq > reads.gam
## Paired end reads
vg map -x index.xg -g index.gcsa -f reads1.fq -f reads2.fq > reads.gam
The mappings are output in GAM format, an analogue of BAM for graphs.
There are two variant callers in vg
, one relying on a samtools pileup
style structure and a more advanced one that behaves more like FreeBayes, termed genotype
. call
deserves its own wiki page, so we'll leave that for later. We'll talk about the genotyper here. The genotyper relies on the GAM index and the graph itself. It works by inserting all paths implied by reads into the graph and then looking for bubbles, then doing some logic to determine which side(s) of the bubble the reads support. You can run it like this:
vg genotype -v graph.vg mapped.gam.index/ > calls.vcf
This will barf a bunch of info to the terminal that relates to support for individual calls. It will also produce a VCF, but the coordinates won't be sorted, so you might want to sort them with vcflib
.
At this point, you've gone from a reference genome, some population variation and a set of reads to genotyped calls.