-
Notifications
You must be signed in to change notification settings - Fork 8
Creating vg graphs
The following is a copy (with some changes) of this guide in the vg github wiki and is given here to provide a guide to create graphs that can be used with Graph Peak Caller:
You should have a reference genome (fasta format) and a vcf relative to this reference. Make sure the chromosome names in your vcf matches the chromosome names used in the reference genomes (e.g. not "chr5" in the reference and "5" in the vcf). Things get a bit easier if all your chromomosomes are number ("5" and not "chr5") since this makes it easier to utilize bash for loops to iterate the chromosomes.
In this guide we assume you have 22 chromosomes in addition to the Y and X chromosome. Thus, we are using bash for loops on the form of for i in $(seq 1 22; echo X; echo Y);
. If you e.g. are making a graph for Arabidopsis Thaliana with chromosomes 1 to 5, you will need to change this to for i in $(seq 1 5;);
.
Not that making a graph using the human 1000 Genomes data will require approximately 1.7 TB of disk space and 3 days of running time on 40 cores only for making the GCSA index. The xg index will require a system with at least 200 GB of RAM. Making a graph for A. thaliana will be considerably faster and require much less disk space (approximately 500 GB) and less CPU time.
In vg construct, we take a VCF file and the reference sequence it was based on and generate a variation graph suitable for use by other tools in vg. In the construction process we interpret the VCF+ref combination as a directed acyclic sequence graph. Variants in the input are decomposed using pairwise alignment between the reference and alternate, so that the input is interpreted as if it had been passed through vcfallelicprimitives.
The construction process can be distributed, as we'll construct the whole genome graph one chromosome at a time. We specify which chromosome we want to work on by giving vg construct
the -R
/--region
parameter, which limits the constructed graph to a particular reference range. Any samtools-format region specifier will work, but here we'll use whole chromosomes. Presently, it's advisable to break the construction apart due to memory requirements of vg--- it is optimized for high performance mutable graphs, and stores them in memory in a very non-succinct way.
Now we make a file $i.vg
for each chromosome $i
:
ref=your_reference.fa
vars=your_variants.vcf.gz
echo constructing graph
(seq 1 22; echo X; echo Y) | parallel -j 24 "time vg construct -C -R {} -r $ref -v $vars -t 1 -m 32 >{}.vg"
This step should take about half an hour when using 24 cores for a human reference genome with 1000 genomes vcf. The result will be around 4.5G of .vg
files, which are compressed protobufv3 defined by the vg format specification with a chunked on-disk format implemented by stream.hpp.
Next, we'll need to generate a joint id space across each graph, since they were generated without coordination. The node id space that results from vg construct
will be partially ordered based on the fact that a reference + VCF will generate a DAG. There are other ways to work around this (such as giving each graph a big node id space up front) but this seems to be the simplest solution. It will take another hour or so:
vg ids -j $(for i in $(seq 1 22; echo X; echo Y); do echo $i.vg; done)
Note that I'm passing multiple .vg
files to vg ids
. These are treated as if they define one big graph. You can pass multiple sub-graphs to vg kmers
, vg ids
, and vg index
. Each graph will be loaded and processed serially, limiting maximum memory to the size of the largest graph in memory.
Now that the .vg
files contain a single graph with a coherent node id system, we can convert them into a single database that contains both the graph itself and kmers of the graph.
For a human reference graph, you will need 60G of disk to store this index pair, but several terabytes to build it (1.7T) plus a lot of I/O bandwidth. Both indexing processes will require a large amount of memory (up to ~200G for xg), and you may wish to distribute parts of the GCSA indexing over a cluster to complete in a reasonable amount of wallclock time.
We build the xg index in one step:
vg index -x wg.xg $(for i in $(seq 22; echo X; echo Y); do echo $i.vg; done)
It should take about an hour and a half. The resulting index is 30G, and takes up about the same amount of space on disk and in memory. It also provides near constant-time queries for entities of interest in the graph.
The GCSA index requires graph simplification. We mask out high-complexity regions from the graph and replace them with paths corresponding to the reference sequence.
for chr in $(seq 1 22; echo X; echo Y);
do
vg prune -r $chr.vg > $chr.pruned.vg
done
Although it masks regions of the graph, the complexity reduction does not destroy the id space. As such we can use the new graph to build an index for the whole genome graph.
vg index -g wg.gcsa $(for i in $(seq 22; echo X; echo Y); do echo $i.pruned.vg; done)
For a human reference graph, this takes about 32 hours on a 32 core system. The resulting GCSA2 index will be named wg.gcsa
and the LCP array stored as wg.gcsa.lcp
. In this case we've done 3 rounds of prefix doubling (16 to 32, 32 to 64, and 64 to 128), so the index will allow queries of up to length 128. We need many terabytes of I/O to build the index, but only around 100G of RAM at peak. Important: The location of the temporary files created for this process is specified using the TMPDIR environment variable. Make sure it is set to a volume a couple of terabytes of free space.