Skip to content

A tool for calculation of pileup mappability for any genome of interest.

License

Notifications You must be signed in to change notification settings

maxgmarin/pupmapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pupmapper logo

License: MIT Static Badge

Pupmapper: A Pileup Mappability Calculator

Table of Contents

Motivation

The Pileup Mappability metric can be used to quickly identify regions which may be more difficult to perform variant calling with short-read WGS data. pupmapper was created to allow users to quickly convert k-mer mappability scores to pileup mappability.

The first step of the pupmapper pipeline is to calculate k-mer uniquness scores using the Genmap software. Then pupmapper will summarize the pileup mappability of each genomic position using the k-mer uniqueness of all overlapping k-mers.

How is pileup mappability calculated from individual k-mer uniqueness/mappability scores?

PmapFig

The Pileup mappability of a position is specifically calculated as the mean k-mer mappability of all k-mers overlapping a given position.

A pileup mappability score of 1 indicates that all k-mers overlapping with a position are unique within the genome (using the user defined parameters of uniqueness).

Pileup mappability is useful because it gives a sense of uniquemess of all possible reads (of defined length) that could align to a given position.

Useful reading for k-mer mappability and pileup mappability:

Derrien, T, (2012). Fast Computation and Applications of Genome Mappability. PLOS ONE 7(1): e30377. https://doi.org/10.1371/journal.pone.0030377

Pockrandt C, (2020) GenMap: ultra-fast computation of genome mappability, Bioinformatics, Volume 36, Issue 12, June 2020, Pages 3687–3692, https://doi.org/10.1093/bioinformatics/btaa222

Lee H, Schatz MC. (2012). Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score, Bioinformatics, Volume 28, Issue 16, August 2012, Pages 2097–2105, https://doi.org/10.1093/bioinformatics/bts330

Installation

You will need to install both the pupmapper python package and ensure that the genmap software is installed (available on your $PATH environmental variable.)

Install locally

pupmapper can be installed by cloning this repository and installing with pip.

git clone [email protected]:maxgmarin/pupmapper.git

cd pupmapper

pip install . 

pip

pip install pupmapper

conda

🚧 Check back soon 🚧

Basic usage

1) run_all - Run the full pipeline starting with an input genome

pupmapper run_all -i Input.Genome.fasta -o output_directory/ -k 50 -e 1

The above command will first use genmap to calculate k-mer mappability scores for the input genome and then calculate pileup mappability scores.

Arguments:

-i, --in_genome_fa: Input genome FASTA file.
-o, --outdir: Directory for output files.
-k, --kmer_len: K-mer length (e.g., 50 bp).
-e, --errors: Number of allowed mismatches in k-mer mapping.
-g, --gff: (Optional) Input genome annotations in GFF format.
--save-numpy: (Optional) Save results as compressed numpy arrays.

Analyzing included test sequence

If you wish to run an pupmapper on a small test sequence (15 bp), you can run the following commands:

cd tests/data/Genmap_Ex1/

pupmapper run_all -i Ex1.genome.fasta -o Ex1_OutputDir -k 4 -e 0

This command will analyze the pileup mappability of the test sequence with a k-mer size of 4 bp and a max mismatch of 0 (K=4,E=0).

Full usage

pupmapper run_all --help
usage: pupmapper run_all [-h] -i IN_GENOME_FA -o OUTDIR -k KMER_LEN -e ERRORS [-g GFF] [--save-numpy]

optional arguments:
  -h, --help            show this help message and exit
  -i IN_GENOME_FA, --in_genome_fa IN_GENOME_FA
                        Input genome fasta file (.fasta)
  -o OUTDIR, --outdir OUTDIR
                        Directory for all outputs of k-mer and pileup mappability processing.
  -k KMER_LEN, --kmer_len KMER_LEN
                        k-mer length (bp) used to generate the k-mer mappability values
  -e ERRORS, --errors ERRORS
                        Number of errors (mismatches) allowed in Genmap's k-mer mappability calculation
  -g GFF, --gff GFF     GFF formatted genome annotations for input genome (.gff) (Optional)
  --save-numpy          If enabled, all pileup mappability scores will be output as compressed numpy arrays (.npz).

About

A tool for calculation of pileup mappability for any genome of interest.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages