Skip to content

Latest commit

 

History

History
86 lines (59 loc) · 5.08 KB

README.md

File metadata and controls

86 lines (59 loc) · 5.08 KB
PHI Logo

PHI (Pangenome-based Haplotype Inference)

Getting Started

Prerequisites

Before using PHI, please ensure that Miniforge is installed: Miniforge Installation Guide. This package installer is used for installing a few dependencies such as VG and samtools. To run PHI, you also need a Gurobi license. You can get a free academic license here. You should download and save gurobi.lic file in your home directory.

Get PHI

git clone https://github.com/at-cg/PHI
cd PHI
# Install dependencies (Miniforge is required)
./Installdeps
export PATH="$(pwd)/extra/bin:$PATH"
export LD_LIBRARY_PATH="$(pwd)/extra/lib:$LD_LIBRARY_PATH"
make

# test run 
./PHI -t32 -g test/MHC_4.gfa.gz -r test/CHM13_reads.fq.gz -o CHM13.fa

# test run with VCF file as input
./vcf2gfa.py -v test/MHC_4.vcf.gz -r test/MHC-CHM13.0.fa.gz | bgzip > test/MHC_4_vcf.gfa.gz
./PHI -t32 -g test/MHC_4_vcf.gfa.gz -r test/CHM13_reads.fq.gz -o CHM13.fa

Adding Binary and Library Paths to .bashrc

To ensure that the extra/bin and extra/lib directories are automatically loaded for every terminal session, you can export them to your ~/.bashrc. This will make sure the required binaries and libraries for PHI are available.

# Add extra/bin and extra/lib to .bashrc
echo 'export PATH="$(pwd)/extra/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="$(pwd)/extra/lib:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc

Table of Contents

Introduction

PHI is a pangenome-based genotyping method. It estimates complete haplotype sequence from low-coverage sequencing data (short-reads or long-reads of a haploid genome). Users should provide a pangenome graph reference in either:

  • Graph Format (GFA v1.1): A sequence graph-based representation of the pangenome graph. Graph should be acyclic.
  • Variant Call Format (VCF): A list of multi-sample, multi-allelic phased variants along with a reference genome.

Output of PHI is the haplotype sequence (FASTA) associated with the optimal inferred path from the graph. It identifies a path in the pangenome graph that maximizes the matches between the path and read k-mers while minimizing recombination events (haplotype switches) along the path. We implemented integer programming to compute an optimal solution. The integer program is solved optimally using the Gurobi optimizer. Details of these formulations are described in our paper.

Results

We benchmarked PHI (v1.0) using short-read datasets sampled from MHC sequences of five haplotypes (APD, DBB, MANN, QBL, and SSTO). This data was generated by Houwaart et al. (2022). These datasets were downsampled to various coverages ranging from 0.1x to 10x. We built a pangenome graph using Minigraph-Cactus, comprising 49 complete MHC sequences. To assess the accuracy of PHI, we evaluated the edit distance between the inferred haplotype sequences and the MHC sequences from Houwaart et al. that were determined using de novo assembly and curation.

F1-score

Edit distance between ground-truth haplotype sequences and the sequences estimated by different tools (PHI, VG, and PanGenie). Lower edit distance implies higher accuracy. PHI provides advangate over existing methods on low-coverage inputs.

In PHI, we have implemented two integer programs (referred to as ILP and IQP respectively). They both solve the same problem, but differ in terms of their runtime and memory-usage. IQP is generally faster but it requires more memory. Users can select between the two using command line argument (see ./PHI -h).

F1-score

Performance comparison between ILP and IQP.

The scripts to reproduce the results are available here.

Future Work

  • Add support for diploid genome estimation.
  • Scale to pangenome graphs having larger number of genomes.

Publications