Merge branch 'update_doc' of github.com:labgem/PPanGGOLiN into update…

…_doc
labgem · Nov 27, 2023 · 54f9fb4 · 54f9fb4
2 parents 148d9a0 + 8026a2b
commit 54f9fb4
Show file tree

Hide file tree

Showing 12 changed files with 412 additions and 26 deletions.
diff --git a/docs/user/PangenomeAnalyses/pangenomeAnalyses.md b/docs/user/PangenomeAnalyses/pangenomeAnalyses.md
@@ -1,15 +1,34 @@
 # Pangenome analyses
 
-```{include} ./pangenomeBuild.md
+## Workflow
+```{include} ./pangenomeWorkflow.md
 ```
 
-## Pangenome outputs
+## Annotation
+
+```{include} ./pangenomeAnnotation.md
+```
+
+(clustering)=
+## Compute pangenome gene families
+```{include} ./pangenomeCluster.md
+```
+
+## Graph
+```{include} ./pangenomeGraph.md
+```
+
+## Partition
+```{include} ./pangenomePartition.md
+```
 
+(pan-output)=
+## Pangenome outputs
 ```{include} ./pangenomeStat.md
 ```
 
 ```{include} ./pangenomeFigures.md
 ```
 
-```{include} ./pangenomeGraph.md
+```{include} ./pangenomeGraphOut.md
 ```
diff --git a/docs/user/PangenomeAnalyses/pangenomeAnnotation.md b/docs/user/PangenomeAnalyses/pangenomeAnnotation.md
@@ -0,0 +1,71 @@
+(annot-fasta)=
+### Annotate fasta file
+
+As an input file, you can provide a list of .fasta files. 
+If you do so, the provided genomes will be annotated using the following tools: 
+
+- [Pyrodigal](https://pyrodigal.readthedocs.io/en/stable/index.html) to annotate the CDS, which is based on Prodigal,
+- [ARAGORN](http://www.ansikte.se/ARAGORN/) to annotate the tRNA
+- [Infernal](http://eddylab.org/infernal/) coupled with HMM of the bacterial and archaeal rRNAs downloaded from [RFAM](https://rfam.xfam.org/) to annotate the RNA command-line tools.
+
+To run this part of the pipeline, you must create an ORGANISMS_FASTA_LIST which is a tab-separated file with the following organisation :
+
+1. The first column contains a unique organism name
+2. The second column the path to the associated FASTA file
+3. Circular contig identifiers are indicated in the following columns
+4. Each line represents an organism
+
+You can check [this example](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list)
+
+To run the annotation part, you can use this minimal command:
+
+```
+ppanggolin annotate --fasta ORGANISM_FASTA_LIST
+```
+
+#### Use a different genetic code in my annotation step
+To annotate the genomes, you can easily change the translation table (or genetic code) used by Pyrodigal just by giving the corresponding number as described [here](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi).
+
+#### Force the Prodigal procedure
+Prodigal can predict gene in [single/normal mode](https://github.com/hyattpd/prodigal/wiki/gene-prediction-modes#normal-mode) that include a training step on your genomes or in [meta/anonymous mode](https://github.com/hyattpd/prodigal/wiki/gene-prediction-modes#anonymous-mode) which use pre-calculated training files. 
+As recommended in the Prodigal documentation: "Anonymous mode should be used on metagenomic data sets, or on sequences too short to provide good training data."
+By default PPanGGOLiN will decide the best mode in function of the contig length.
+It's possible to force the procedure with the option `-p, --prodigal_procedure`.
+The option accept only **single** or **meta** keyword, corresponding to the prodigal procedure name.
+
+#### Customize the RNA annotation
+If you don't want to predict the RNA (and so don't use Infernal and Aragorn) you can add the option `--norna` in your command.
+Else, by default the CDS overlapping any RNA genes will be deleted as they are usually false positive calls.
+You can prevent this filtering by using `--allow_overlap` option.
+
+Moreover, when you are working with archaea genomes, you can use the option `--kingdom archaea` to indicate to infernal which model to use to annotate RNA. 
+
+### Use annotated file as pangenome base
+
+You can also provide your annotation files.
+They can be either gff3 files or .gbk files or .gbff files, or a mix of them, and should be provided through a list in a tab-separated file alike [this example](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list).
+
+```{note}
+Use your own annotation is especially recommended if you already have functional annotations of your genome,
+ as they will be added to the pangenome
+```
+
+You can provide them using the following command : 
+
+```
+ppanggolin annotate --anno ORGANISM_ANNOTATION_LIST
+```
+
+#### How to deal with annotation files without sequences
+
+If your annotation files do not have the genome sequence in them, 
+you can use both options at the same time (to have both the gene annotations and the gene sequences) as such : 
+
+```
+ppanggolin annotate --anno ORGANISM_ANNOTATION_LIST --fasta ORGANISM_FASTA_LIST
+```
+
+#### Take the pseudogenes into account for pangenome analyses
+
+By default PPanGGOLiN will not take into account the pseudogene. However, they could be interesting in some context.
+So it's possible to add the pseudogenes in the pangenome with the option `--use_pseudo`.
diff --git a/docs/user/PangenomeAnalyses/pangenomeBuild.md b/docs/user/PangenomeAnalyses/pangenomeBuild.md
diff --git a/docs/user/PangenomeAnalyses/pangenomeCluster.md b/docs/user/PangenomeAnalyses/pangenomeCluster.md
@@ -0,0 +1,60 @@
+### Cluster genes into gene families
+
+Once we have annotated genomes, we need to compare them to know which are similar, and to build gene families through this information. 
+
+If you provided .fasta files or annotation files with gene sequences in them, clustering can be run directly by providing the .h5 file that was generated, as such : 
+
+```
+ppanggolin cluster -p pangenome.h5
+```
+
+PPanGGOLiN will call [MMseqs2](https://github.com/soedinglab/MMseqs2) to run the clustering on all the protein sequences by searching for connected components for the clustering step. 
+You can tune its parameters using `--identity`(default 0.8) and `--coverage`(default 0.8). 
+You can use other clustering algorithms of MMseqs by using --mode (default 1). 
+Both protein sequences have to be covered by at least the proportion indicated by --coverage.
+
+#### How to customize MMSeqs2 clustering
+```{attention}
+All the MMSeqs2 options are not available in PPanGGOLiN if you want a complete view of MMSeqs2 option take a look at their documentation and you can provide your custom clustering as describe in the [next part](#read-clustering)
+```
+
+[//]: # (TODO complete this part)
+
+(read-clustering)=
+### Providing your gene families
+
+If you do not want to use MMseqs2 and provide your clusters (or gene families) you can do so only if you provided the annotations in the first step. 
+In the case of gff3 files, the 'ID' field in the 9th column is expected as a gene id. 
+In the case of gbff or gbk files, the 'locus_tag' is used as a gene id, except with files coming from MaGe or from SEED, where the id provided in the 'db_xref' field is used.
+
+You will need to provide a .tsv file. 
+The first column indicates the cluster id, and the second column indicates a unique gene id that is used in the annotation files. 
+There is a single gene id per line.
+
+You can do that through the command line : 
+
+`ppanggolin cluster -p pangenome.h5 --clusters MY_CLUSTERS_FILE`
+
+An example of what MY_CLUSTERS_FILE should look like is provided [here](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/clusters.tsv)
+
+
+### Defragmentation
+
+We noticed that most of the cloud genes in the pangenome are fragments of 'shell' or 'persistent' genes, and so not informative on the pangenome's diversity. 
+We added another workflow to reduce the number of gene families and reduce the computational load by trying to associate fragments to their original gene families.
+It adds a step to the clustering described previously. 
+It will compare all the gene families representative protein sequences using the same identity threshold as the first step. 
+It will also use the same coverage threshold, but only the smallest of both protein sequences have to be covered by at least the value indicated by `--coverage`.
+
+After that, we build a similarity graph where the edges are the hits given by the comparison, and the nodes are the original gene families. 
+Then we iterate on all nodes and compare them to their neighbors. 
+If the neighbor of a node is more numerous (has more members in the cluster it represents) and its representative sequence is longer, that node (and all the genes associated) is associated with the neighbor. 
+The genes associated with this node are defined as 'fragments' of the gene family represented by the longer and more numerous neighboring node.
+
+To avoid using it, you can run the following:
+
+```
+ppanggolin cluster -p pangenome.h5 --no_defrag
+```
+
+In any case and whichever pipeline you use, in the end, the gene families will be saved in the 'pangenome.h5' given as input.
diff --git a/docs/user/PangenomeAnalyses/pangenomeFigures.md b/docs/user/PangenomeAnalyses/pangenomeFigures.md
@@ -1,9 +1,50 @@
 ### Pangenome figures output
 
 #### U-shape plot
+A U-shaped plot is a figure presenting the number of families (y axis) per number of organisms (x axis).
+It is a .html file that can be opened with any browser and with which you can interact, zoom, move around, mouseover to see numbers in more detail, and you can save what you are seeing as a .png image file.
+
+It can be generated using the 'draw' subcommand as such : 
+
+`ppanggolin draw -p pangenome.h5 --ucurve`
 
 #### tile plot
 
+A tile plot is a heatmap representing the gene families (y axis) in the organisms (x axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in an organism and uncolored if absent. The gene families are ordered by partition, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
+
+This plot is quite helpful to observe potential structures in your pangenome, and can also help you to identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you which is the gene identifier(s), the gene family and the organism that corresponds to the tile.
+
+If you build your pangenome using the 'workflow' subcommand and you have more than 500 organisms, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.
+
+It can be generated using the 'draw' subcommand as such : 
+
+`ppanggolin draw -p pangenome.h5 --tile_plot`
+
+and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option : 
+
+`ppanggolin draw -p pangenome.h5 --tile_plot --nocloud`
+
 #### Rarefaction curve
+This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group. The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary if you are still adding a lot of genes you may be still missing a lot of gene families. 
+
+There are 8 partitions represented. For each of the partitions there are multiple representations of the observed data. You can find the observed means, medians, 1st and 3rd quartiles of the number of gene families per number of genome used. And you can find the fitting of the data by the Heaps' law, which is usually used to represent this evolution of the diversity in terms of gene families in each of the partitions.
+
+It can be generated using the 'rarefaction' subcommand, which is dedicated to drawing this graph, as such : 
+
+`ppanggolin rarefaction -p pangenome.h5`
+
+A lot of options can be used with this subcommand to tune your rarefaction curves, most of them are the same as with the `partition` workflow.
+The following 3 are related to the rarefaction alone:
+
+- `--depth` defines the number of sampling for each number of organism (default 30)
+- `--min` defines the minimal number of organisms in a sample (default 1)
+- `--max` defines the maximal number of organisms in a sample (default 100)
+
+So for example the following command:
+`ppanggolin rarefaction -p pangenome.h5 --min 5 --max 50 --depth 30`
+
+Will draw a rarefaction curve with sample sizes between 5 and 50 (between 5 and 50 genomes will be used), and with 30 samples at each point (so 30 samples of 5 genomes, 30 samples or 6 genomes ... up to 50 genomes).
+
+#### ProkSee
 
-#### ProkSee
+[//]: # (TODO after merge with split command)
diff --git a/docs/user/PangenomeAnalyses/pangenomeGraph.md b/docs/user/PangenomeAnalyses/pangenomeGraph.md
@@ -1,6 +1,16 @@
-### Pangenome graph output
+To partition a pangenome graph, you need to build a said pangenome graph.
+This can be done through the `graph` subcommand.
+This will take a pangenome .h5 file as input and compute edges to link gene families together based on the genomic neighborhood.
+The graph is constructed using the following subcommand :
 
-#### Gephi
+```
+ppanggolin graph -p pangenome.h5
+```
 
+This subcommand has only a single other option, which is `-r` or `--remove_high_copy_number`.
+If used, it will remove the gene families that are too duplicated in your genomes.
+This is useful if you want to visualize your pangenome afterward and want to remove the biggest hubs to have a clearer view.
+It can also be used to limit the influence of very duplicated genes such as transposase or ABC transporters in the partition step.
 
-#### JSON
+
+The resulting pangenome graph is saved in the pangenome.h5 file given as input.
diff --git a/docs/user/PangenomeAnalyses/pangenomeGraphOut.md b/docs/user/PangenomeAnalyses/pangenomeGraphOut.md
@@ -0,0 +1,34 @@
+### Pangenome graph output
+
+The Graph can be given through the .gexf and through the _light.gexf files. The _light.gexf file will contain the gene families as nodes and the edges between gene families describing their relationship, and the .gexf file will contain the same thing, but also include more informations about each gene and each relation between gene families. 
+We have made two different files representing the same graph because, while the non-light file is exhaustive, it can be very heavy to manipulate and most of the information in it are not of interest to everyone. The _light.gexf file should be the one you use to manipulate the pangenome graph most of the time.
+
+They can be manipulated and visualised through a software called [Gephi](https://gephi.org/), with which we have made extensive testings, or potentially any other softwares or libraries that can read gexf files such as [networkx](https://networkx.github.io/documentation/stable/index.html) or [gexf-js](https://github.com/raphv/gexf-js) among others. 
+
+Using Gephi, the layout can be tuned as illustrated below:
+
+![Gephi layout](../../_static/gephi.gif)
+
+We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker the layout parameters.
+
+In the _light.gexf file : 
+The nodes will contain the number of genes belonging to the gene family, the most commun gene name (if you provided annotations), the most common product name(if you provided annotations), the partitions it belongs to, its average and median size in nucleotids, and the number of organisms that have this gene family.
+
+The edges contain the number of times they are present in the pangenome.
+
+The .gexf non-light file will contain in addition to this all the information about genes belonging to each gene families, their names, their product string, their sizes and all the information about the neighborhood relationships of each pair of genes described through the edges.
+
+The light gexf can be generated using the 'write' subcommand as such : 
+
+`ppanggolin write -p pangenome.h5 --light_gexf`
+
+while the gexf file can be generated as such : 
+
+`ppanggolin write -p pangenome.h5 --gexf`
+
+#### JSON
+The json's file content corresponds to the .gexf file content, but in json rather than gexf file format. It follows the 'node-link' format as shown in [this example](https://observablehq.com/@d3/force-directed-graph) in javascript, or as used in the [networkx](https://networkx.github.io/documentation/stable/reference/readwrite/json_graph.html) python library and it should be usable with both [D3js](https://d3js.org/) and [networkx](https://networkx.github.io/documentation/stable/index.html), or any other software or library that supports this format.
+
+The json can be generated using the 'write' subcommand as such : 
+
+`ppanggolin write -p pangenome.h5 --json`
diff --git a/docs/user/PangenomeAnalyses/pangenomePartition.md b/docs/user/PangenomeAnalyses/pangenomePartition.md
@@ -0,0 +1,35 @@
+
+This is the step that will assign gene families to the 'persistent', 'shell', or 'cloud' partitions. 
+
+
+The 'persistent' partition will group genes that are present throughout the entire species. 
+They will be essential genes, genes required for important metabolic pathways and genes that define the metabolic and biosynthetic capabilities of the taxonomic group.
+
+The 'shell' partition groups genes that are present in only some individuals. 
+Those are often genes that were acquired through horizontal gene transfers and encode for functions involved in environmental adaptations, pathogenicity, virulence or encoding secondary metabolites for example.
+
+The 'cloud' partition groups genes that are very rare in the pangenome and found in one, or very few, individuals. 
+Most of the genes were associated with phage-related genes. 
+They probably all were acquired through horizontal gene transfers. 
+Antibiotic resistance genes were often found to be belonging to the cloud genome, as well as plasmid genes.
+
+It can be realized through the following subcommand : 
+
+`ppanggolin partition -p pangenome.h5`
+
+It also has quite a few options. 
+Most of them are not self-explanatory. 
+If you want to know what they do, you should read the PPanGGOLiN paper (you can read it [here](https://journals.plos.org/ploscompbiol/article?rev=2&id=10.1371/journal.pcbi.1007732)) where the statistical methods used are thoroughly described.
+
+The one parameter that might be of importance is the '-K', or '--nb_of_partitions' parameter. 
+This will define the number of classes used to partition the pangenome. 
+This may be of use if you expect to have well-defined subpopulations in your pangenome, and you know exactly how many. 
+If not, that number is detected automatically through an ICL criterion. 
+The idea is that the most present partition will be 'persistent', the least present will be 'cloud', and all the others will be 'shell'. 
+The number of partitions corresponding to the shell will be the number of expected subpopulations in your pangenome. 
+(So if you expect 5 subpopulations, you could use -K 7). 
+
+
+In most cases, you should let the statistical criterion used by PPanGGOLiN find the optimal number of partitions for you.
+
+All the results will be added to the given 'pangenome.h5' input file.