A modularized version of the program PopIns2 for population-scale detection of non-reference sequence variants.
Popins4snake is a program consisting of several functions. The functions are designed to be chained into a workflow, together with calls to standard bioinformatics programs (samtools, bwa, ...) and bash commands.
The recommended way of running popins4snake is using the Snakemake workflow PopinSnake.
Prior to the installation make sure your system meets all the requirements:
Requirement | Tested with |
---|---|
64 bits POSIX-compliant operating system | Ubuntu 20.04, CentOS Linux 7.6 |
C++14 capable compiler | g++ vers. 4.9.2, 5.5.0, 7.2.0, 9.4.0 |
CMake | >= 2.8.12 (available through Conda) |
For the default settings of popins4snake a Bifrost installation with MAX_KMER_SIZE=64 is required (see below). Presently, the conda package of Bifrost does not meet this requirement. Therefore, Bifrost is included as a submodule in this repository.
CMake is required for installing Bifrost.
The SeqAn header library is included in this repository and comes with the git clone. There is no need for a manual installation.
First clone the repository with the --recursive
flag:
git clone --recursive https://gitlab.informatik.hu-berlin.de/fonda_a6/popins4snake.git
Next, compile and install Bifrost with MAX_KMER_SIZE=64
. You can either install it globally on your system or locally in your home directory.
We here describe how to install it locally in a folder external/bifrost/local
.
This is the location, where the popins4snake
Makefile will look for it by default.
cd external/bifrost && mkdir build && cd build
mkdir ../local
cmake .. -DCMAKE_INSTALL_PREFIX=../local -DMAX_KMER_SIZE=64
make
make install
Now, you can compile popins4snake:
cd popins4snake
mkdir build
make
After the compilation with make
you should see the binary popins4snake in the cloned directory.
The PopIns2 Wiki gathers known issues that might occur during installation or runtime.
The recommended way of running popins4snake is using the Snakemake workflow PopinSnake.
To get an overview of the functions offered in popins4snake, you can run ./popins4snake -h
after installation.
To display the help page of each of the popins4snake functions, type ./popins4snake <command> --help
.
The former will print something similar to this:
=====================================================================
A modularized version of the program PopIns2
for population-scale detection of non-reference sequence variants
=====================================================================
SYNOPSIS
./popins4snake COMMAND [OPTIONS]
COMMAND
crop-unmapped Extract unmapped and poorly aligned reads from a BAM file.
merge-bams Merge two name-sorted BAM files of the same sample and set mate information of now paired reads.
merge-contigs Merge sets of contigs into supercontigs using a colored compacted de Bruijn Graph.
find-locations Find insertion locations of (super-)contigs per sample.
merge-locations Merge insertion locations from all samples into one file.
place-refalign Find positions of (super-)contigs by aligning contig ends to the reference genome.
place-splitalign Find positions of (super-)contigs by split-read alignment (per sample).
place-finish Combine (super-)contig positions found by split-read alignment from all samples.
genotype Determine genotypes of all insertions in a sample.
VERSION
0.1.0-a52d4f5, Date: 2022-08-25 14:42:31
Try `./popins4snake COMMAND --help' for more information on each command.
Below we provide information on detailed explaination on each function comparing to the original PopIns2, how the program is constructed in the popinSnake workflow, and some customizable parameters to set up in the config files before running the workflow.
popins4snake crop-unmapped [OPTIONS] sample.bam
The crop-unmapped command identifies reads without high-quality alignment to the reference genome. The reads given in the input BAM file must be indexed, i.e. the file sample.bam.bai
is expected to exist.
Originally part of the assemble
function from PopIns and PopIns2, now an independent function for the workflow, the unmapped reads will be sorted by samtools, get filtered by read quality through SICKLE, and assembled.
crop-unmapped now provide its own quality filtering method by adding --min-qual
and --min-read-len
, but SICKLE can still be called directly from the workflow.
Workflow Configuration
Through the config file, user can select their desired quality filtering method and choose their preferred assembler: MINIA or VELVET following the quality filering step.
popins4snake merge-contigs [OPTIONS]
The merge-contig command builds a colored and compacted de Bruijn Graph (ccdbg) of all contigs of all samples in a given source directory. For general usage see PopIns2 merge function.
Workflow Configuration
As in the snakemake workflow, user can set the k-mer size for customized Algorithm options. The function also supports multi-threading for running on a cluster, setup the number of threads in cluster_config.yaml
.
popins4snake merge-bams [OPTIONS] input1.bam input2.bam
As part of the contigmap function in PopIns and PopIns2, now used in contigmap module in the workflow, merge-bams
merges the mapped and sorted files from BWA and SAMtools in the contigmap module. This process anchors both ends of each read pair, ensuring that pairs with one end aligned to the reference genome and the other end aligned to the supercontigs are brought together.
The functions below, including find-locations
,merge-locations
and place
functions, are part of the position module in the workflow. Since the workflow now supports optional contamination removal, some intermediate files have changed based on the config conditions. Therefor these functions were adjusted to take files with removed contaminations and aligned to alternative references during the cleaning steps.
popins4snake find-locations [OPTIONS] SAMPLE_ID
This funciton anchors the aligned read pairs to the reference and determines the position of the read pairs on the genome from each sample.
popins4snake merge-locations [OPTIONS]
This function combines the detected locations of read pairs from all input samples in one file.
popins4snake place-refalign [OPTIONS]
popins4snake place-splitalign [OPTIONS] SAMPLE_ID
popins4snake place-finish [OPTIONS]
In brief, the place commands attempt to anker the supercontigs to the samples. At first, all potential anker locations from all samples are collected. Then prefixes/suffixes of the supercontigs are aligned to all collected locations. For successful alignments records are written to a VCF file. In the second step, all remaining locations are split-aligned per sample. Finally, all locations from all successful split-alignments are combined and added to the VCF file.
Workflow Configuration
As used in the workflow, user can set the value for --readlength
parameter for place-refalign and place-spitlign from the snake_config.yaml
.
popins4snake genotype [OPTIONS] SAMPLE_ID
The genotype command generates alleles (ALT) of the supercontigs with some flanking reference genome sequence. Then, the reads of a sample are aligned to ALT and the reference genome around the breakpoint (REF). The ratio of alignments to ALT and REF determines a genotype quality and a final genotype prediction per variant per sample. Combined with BCFtools sort and merge functions, these steps completed the genotype module of the workflow.
Krannich T., White W. T. J., Niehus S., Holley G., Halldórsson B. V., Kehr B. (2022) Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics, 38(3):604–611.
Kehr B., Helgadóttir A., Melsted P., Jónsson H., Helgason H., Jónasdóttir Að., Jónasdóttir As., Sigurðsson Á., Gylfason A., Halldórsson G. H., Kristmundsdóttir S., Þorgeirsson G., Ólafsson Í., Holm H., Þorsteinsdóttir U., Sulem P., Helgason A., Guðbjartsson D. F., Halldórsson B. V., Stefánsson K. (2017). Diversity in non-repetitive human sequences not found in the reference genome. Nature Genetics, 49(4):588–593.
Kehr B., Melsted P., Halldórsson B. V. (2016). PopIns: population-scale detection of novel sequence insertions. Bioinformatics, 32(7):961-967.