Skip to content

Artifact corresponding to the PanGraph-DB framework for performing complex multi-pangenome analyses using a graph database system.

Notifications You must be signed in to change notification settings

jpjarnoux/PanGraph-DB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PanGraph-DB : A Graph Database Framework for Complex Multi-Pangenome Analyses

About the project

PanGraph-DB is a data-centric pipeline capable of operating on a unified graph dataset consisting of multiple pangenome graphs, as computed by the PPanGGOLiN framework, and that further leverages the Neo4j graph database to perform complex analyses. These are expressed as graph database queries in Neo4j's Cypher query language. Note that the methodology is, however, system-agnostic and can be reproduced using any other graph database system.

This repository includes a Jupyter notebook describing performance and scalability experiments performed on datasets of up to 10 pangenomes, with sizes ranging from 200 to 1800 MB, on a workload comprised of 10 queries (available here).

Pangenomes # of genes # of genomes # of families # raw edges # of RGPs # of spots # of modules HDF5 size (MB)
Acinetobacter baumannii 1 044 515 285 14 400 30 147 9 764 364 609 616
Enterobacter bugandensis 526 062 118 18 143 23 734 3 424 326 250 212
Enterobacter cloacae 651 827 137 22 953 32 270 6 083 292 526 358
Enterobacter hormaechei 739 490 159 18 166 29 798 5744 280 742 415
Enterobacter kobei 705 811 150 20 836 29 311 5 740 181 535 386
Enterobacter roggenkampii 978 031 210 26 080 40 459 8 807 319 712 537
Enterococcus faecium 570 257 207 7 889 18 627 6 195 189 318 301
Klebsiella pneumoniae 3 100 409 600 29 139 61 865 25 014 529 1 167 1 800
Pseudomonas aeruginosa 1 892 646 313 23 699 42 084 10 706 543 909 1200
Staphylococcus aureus 1 686 977 638 7 017 18 047 11 869 268 203 991

Authors

  • Jérôme Arnoux, Genoscope/LABGeM - CEA, CNRS, Paris Saclay University
  • Angela Bonifati, Liris CNRS, Lyon 1 University
  • Alexandra Calteau, Genoscope/LABGeM - CEA, CNRS, Paris Saclay University
  • Stefania Dumbrava, SAMOVAR/Inst. Polytechnique de Paris, ENSIIE
  • Guillaume Gautreau, MetaGenoPolis, University Paris-Saclay, INRAE, MGP

Dependencies

We list all required dependencies below.

Use pip to install:

  • dict2graph==2.0.0
  • graphio==0.4.0

Use conda to install:

  • ppanggolin==1.2.74
  • pyhmmer==0.6.3
  • py2neo==2021.2.3
  • rgi==6.0.1
  • genome_updater==0.5.1

Neo4j:

  • Add local DBMS with a Neo4j version of 4.4.11
  • APOC 4.4.0.10 or more
  • Optional : Neo4J Desktop 1.5.0

Dataset

The original dataset is available here.

Running the project

To begin, note that you must have an empty Neo4J DMBS (version 4.4.11) open and available with the APOC plugin install (version 4.4.0.10).

To execute the PangenomeGraph.ipynb script, you will need to first install some packages.

These are listed in the following conda environment file conda-env.yml.

The in development version of PPanGGOLiN is required to satisfy some feature and pangenome compatibility.

To install the conda environment in the jupyter kernel, please copy and paste the following code in your terminal:

git clone https://github.com/labgem/PanGraph-DB.git
conda update -n base -c defaults conda -y
conda env create --file conda-env.yml
conda init bash
conda activate pangraph
git clone -b release1.3 https://github.com/labgem/PPanGGOLiN.git
pip install PPanGGOLiN/.
pip install --user ipykernel
python -m ipykernel install --user --name=pangraph
jupyter notebook --notebook-dir=./PanGraph-DB

Next run all the cells to obtain the corresponding results. Don't forget to change the data path.

About

Artifact corresponding to the PanGraph-DB framework for performing complex multi-pangenome analyses using a graph database system.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published