Merge branch 'split_write_command' of https://github.com/labgem/PPanG…

…GOLiN into split_write_command
labgem · Nov 13, 2023 · af41159 · af41159
2 parents 983043c + 04cfa65
commit af41159
Show file tree

Hide file tree

Showing 16 changed files with 202 additions and 32 deletions.
diff --git a/docs/_static/proksee_exemple_A_baumannii_AYE.png b/docs/_static/proksee_exemple_A_baumannii_AYE.png
diff --git a/docs/_static/proksee_metadata_example.png b/docs/_static/proksee_metadata_example.png
diff --git a/docs/user/Flat/RGP.md b/docs/user/Flat/RGP.md
@@ -2,7 +2,7 @@
 This file is a tsv file that lists all of the detected Regions of Genome Plasticity. This requires to have run the RGP detection analysis by either using the `panrgp` command or the `rgp` command.
 
 It can be written with the following command:
-`ppanggolin write -p pangenome.h5 --regions`
+`ppanggolin write_pangenome -p pangenome.h5 --regions`
 
 The file has the following format :
 
@@ -21,7 +21,7 @@ The file has the following format :
 This is a tsv file with two column. It links the spots of 'summarize_spots' with the RGPs of 'plastic_regions'.
 
 It is written with the following command:
-`ppanggolin write -p pangenome.h5 --spots`
+`ppanggolin write_pangenome -p pangenome.h5 --spots`
 
 |column|description|
 |------|------------|
@@ -33,7 +33,7 @@ It is written with the following command:
 This is a tsv file that will associate each spot with multiple metrics that can indicate the dynamic of the spot.
 
 It is written with the following command:
-`ppanggolin write -p pangenome.h5 --spots`
+`ppanggolin write_pangenome -p pangenome.h5 --spots`
 
 |column| description|
 |-------|------------|
@@ -49,7 +49,7 @@ It is written with the following command:
 #### Borders
 
 Each spot has at least one set of gene families bordering them. To write the list of gene families bordering a spot, you need to use the following option:
-`ppanggolin write -p pangenome.h5 --borders`
+`ppanggolin write_pangenome -p pangenome.h5 --borders`
 
 It will write a .tsv file with 4 columns:
 

diff --git a/docs/user/Flat/dupplication.md b/docs/user/Flat/dupplication.md
@@ -3,6 +3,6 @@ This file lists the gene families, their duplication ratio, their mean presence
 
 It can be generated using the 'write' subcommand as such : 
 
-`ppanggolin write -p pangenome.h5 --stats`
+`ppanggolin write_pangenome -p pangenome.h5 --stats`
 
 This command will also generate the 'organisms_statistics.tsv' file.
diff --git a/docs/user/Flat/fam2gen.md b/docs/user/Flat/fam2gen.md
@@ -4,4 +4,4 @@ It is basically a three-column file listing the gene family name in the first co
 
 You can obtain it as such :  
 
-`ppanggolin write -p pangenome.h5 --families_tsv`
+`ppanggolin write_pangenome -p pangenome.h5 --families_tsv`
diff --git a/docs/user/Flat/genomes_fasta.md b/docs/user/Flat/genomes_fasta.md
@@ -0,0 +1,11 @@
+<!-- ### Adding Fasta Sequences into GFF and proksee JSON map Files -->
+
+PPanGGOLiN allows the incorporation of fasta sequences into GFF files and proksee JSON map files. This integration with Proksee provides access to various tools that rely on DNA sequences, including the construction of GC% and GC skew profiles, and conducting blast searches for example.
+
+
+Since PPanGGOLiN does not retain genomic sequences, it is necessary to provide the original genomic files used to construct the pangenome through either the `--anno` or `--fasta` argument. These arguments mirror those used in workflow commands (`workflow`, `all`, `panrgp`, `panmodule`) and the `annotate` command.
+
+- `--anno`: This option requires a tab-separated file containing organism names and the corresponding GFF/GBFF filepaths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.
+
+- `--fasta`: Use this option with a tab-separated file that lists organism names alongside the filepaths of their genomic sequences in fasta format.
+
diff --git a/docs/user/Flat/genomes_metadata.md b/docs/user/Flat/genomes_metadata.md
@@ -0,0 +1,56 @@
+<!-- ### Incorporating Metadata into Tables, GFF, and Proksee Files -->
+
+You can inject metadata, previously added with the `metadata` command, into genome outputs using the `--add_metadata` parameter. When users add metadata, they specify the source of this metadata. These metadata sources can be selectively included using the `--metadata_sources` parameter. By default, all sources are added when the `--add_metadata` flag is specified.
+
+#### Metadata in GFF Files
+
+Metadata is integrated into the attributes column of the GFF file. The patterns for adding metadata are as follows:
+
+- In CDS lines, metadata associated with genes follow this pattern: `gene_<source>_<key>=<value>`. Gene family metadata follows a similar pattern: `gene_<source>_<key>=<value>`.
+- In the contig lines of type `region` describing the contig, genome metadata is added with the pattern: `genome_<source>_<key>=<value>`, and contig metadata is added with: `contig_<source>_<key>=<value>`.
+- In RGP lines, metadata is added using the pattern: `rpg_<source>_<key>=<value>`.
+
+For example, if we associate metadata is associated with the gene family DYB08_RS16060 with the source `pfam`:
+
+```tsv
+families	accession	type	description
+DYB08_RS16060	PF18894	domain	This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.
+```
+
+This metadata file can be added to the pangenome with the metadata command:
+
+```bash
+ppanggolin metadata -p pangenome.h5 --source pfam --metadata family_pfam_annotation.tsv --assign families
+```
+
+When writing GFF output with the `--add_metadata` flag:
+
+```bash
+ppanggolin write_genomes -p pangenome.h5 --proksee -o proksee_out --gff --add_metadata
+```
+
+A gene belonging to this family would have the following attribute in its GFF line: `family_pfam_accession=PF18894;family_pfam_description=This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.;family_pfam_type=domain`.
+
+```gff
+NC_010404.1	external	CDS	77317	77958	.	-	0	ID=ABAYE_RS00475;Parent=gene-ABAYE_RS00475;product=putative metallopeptidase;family=DYB08_RS16060;partition=persistent;rgp=NC_010404.1_RGP_0;family_pfam_accession=PF18894;family_pfam_description=This entry represents a probable metallopeptidase domain found in a variety of phage and bacterial proteomes.;family_pfam_type=domain
+```
+
+#### Metadata in Proksee Visualization
+
+Metadata can be seamlessly incorporated into Proksee JSON MAP files. These metadata details become accessible by simply hovering the mouse over the features.
+
+For instance, with the metadata previously added to the DYB08_RS16060 gene family, the Proksee visualization would resemble the example below:
+
+```{image} ../_static/proksee_metadata_example.png
+:align: center
+```
+
+
+#### Metadata in Table output
+
+Metadata is seamlessly incorporated into table output with the addition of extra columns. These columns follow the GFF attribute naming: 
+
+- gene metadata: `gene_<source>_<key>`
+- family metadata: `gene_<source>_<key>`
+
+<!-- exemple -->
diff --git a/docs/user/Flat/gff.md b/docs/user/Flat/gff.md
@@ -0,0 +1,48 @@
+
+The `--gff` argument generates GFF files, each containing pangenome annotations for individual genomes within the pangenome. The GFF file format is a widely recognized standard in bioinformatics and can seamlessly integrate into downstream analysis tools.
+
+To generate GFF files from a pangenome HDF5 file, you can use the following command:
+
+```bash
+ppanggolin write_genomes -p pangenome.h5 --gff -o output
+```
+
+This command will create a gff directory within the output directory, with one GFF file per genome. 
+
+Pangenome annotations within the GFF are recorded in the attribute column of the file.
+
+For CDS features, pangenome annotations are recorded in the attribute column of the file:
+
+CDS features have the following attributes:
+
+- **family:** ID of the gene family to which the gene belongs.
+- **partition:** The partition of the gene family, categorized as persistent, shell, or cloud.
+- **module:** If the gene family belongs to a module, the module ID is specified with the key 'module.'
+- **rgp:** If the gene is part of a Region of Genomic Plasticity (RGP), the RGP name is specified with the key 'rgp.'
+
+For Regions of Genomic Plasticity (RGPs), RGPs are specified under the feature type 'region.'
+
+RGPs have the following attributes:
+
+- The attribute 'spot' designates the spot ID where the RGP is inserted. When the RGP has no spot, the term 'No_spot' is used.
+- The 'Note' attribute specifies that this feature is an RGP.
+
+
+Here is an example showcasing the initial lines of the GFF file for the Acinetobacter baumannii AYE genome:
+
+```gff
+##gff-version 3
+##sequence-region NC_010401.1 1 5644
+##sequence-region NC_010402.1 1 9661
+##sequence-region NC_010403.1 1 2726
+##sequence-region NC_010404.1 1 94413
+##sequence-region NC_010410.1 1 3936291
+NC_010401.1	.	region	1	5644	.	+	.	ID=NC_010401.1;Is_circular=true
+NC_010401.1	ppanggolin	region	629	5591	.	.	.	Name=NC_010401.1_RGP_0;spot=No_spot;Note=Region of Genomic Plasticity (RGP)
+NC_010401.1	external	gene	629	1579	.	+	.	ID=gene-ABAYE_RS00005
+NC_010401.1	external	CDS	629	1579	.	+	0	ID=ABAYE_RS00005;Parent=gene-ABAYE_RS00005;product=replication initiation protein;family=ABAYE_RS00005;partition=cloud;rgp=NC_010401.1_RGP_0
+NC_010401.1	external	gene	1576	1863	.	+	.	ID=gene-ABAYE_RS00010
+NC_010401.1	external	CDS	1576	1863	.	+	0	ID=ABAYE_RS00010;Parent=gene-ABAYE_RS00010;product=hypothetical protein;family=ABAYE_RS00010;partition=cloud;rgp=NC_010401.1_RGP_0
+NC_010401.1	external	gene	2054	2572	.	-	.	ID=gene-ABAYE_RS00015
+NC_010401.1	external	CDS	2054	2572	.	-	0	ID=ABAYE_RS00015;Parent=gene-ABAYE_RS00015;product=tetratricopeptide repeat protein;family=HTZ92_RS18670;partition=shell;rgp=NC_010401.1_RGP_0
+```
diff --git a/docs/user/Flat/metrics.md b/docs/user/Flat/metrics.md
@@ -23,9 +23,11 @@ It could be necessary to get more information about the modules.
 Here we provide information about families, and we separate modules in 
 function of the partition. You can get this supplementary information 
 as such :
-```
+```bash
 ppanggolin metrics -p pangenome.h5 --info_modules
-...
+```
+
+```
 Modules : 3
 Families in Modules : 22  (min : 5, max : 9, sd : 2.08, mean : 7.33)
 	Sheel specific : 36.36  (sd : 4.62, mean : 2.67)

diff --git a/docs/user/Flat/module.md b/docs/user/Flat/module.md
@@ -1,7 +1,7 @@
 #### Functional modules
 This .tsv file lists the modules and the gene families that belong to them. It lists one family per line, and there are multiple line for each module.
 It is written along with other files with the following command:
-`ppanggolin write -p pangenome.h5 --modules`
+`ppanggolin write_pangenome -p pangenome.h5 --modules`
 
 It follows the following format:
 |column|description|
@@ -12,7 +12,7 @@ It follows the following format:
 #### Modules in organisms
 This .tsv file lists for each organism the modules that are present and how complete they are. Since there are some variability that are allowed in the module predictions, occasionnally some modules can be incomplete in some of the organisms where they are found.
 This file is written along with other files with the following command:
-`ppanggolin write -p pangenome.h5 --modules`
+`ppanggolin write_pangenome -p pangenome.h5 --modules`
 
 And it follows the following format:
 |column|description|
@@ -24,7 +24,7 @@ And it follows the following format:
 #### modules summary
 This .tsv file lists a few characteristics for each detected module. There is one line for each module.
 The file is written along with other files with the following command:
-`ppanggolin write -p pangenome.h5 --modules`
+`ppanggolin write_pangenome -p pangenome.h5 --modules`
 
 And it follows the following format:
 |column|description|
@@ -39,7 +39,7 @@ And it follows the following format:
 This command is available only if both modules and spots have been computed for your pangenome (see the command `all`, or the commands `spot` and `module` for that).
 It indicates which modules are present in which spot and in which RGP.
 The files are written with the following command:
-```ppanggolin write -p pangenome.h5 --spot_modules```
+```ppanggolin write_pangenome -p pangenome.h5 --spot_modules```
 The format of the 'modules_spots.tsv' file is the following:
 
 |column|description|
@@ -56,4 +56,4 @@ The file 'modules_RGP_lists.tsv' lists RGPs that have the same modules. Those RG
 |mod_list| a list of the modules that are in the indicated RGPs|
 |RGP_list| a list of RGP that include exactly the modules listed previously|
 
-This information can also be visualized through figures that can be drawn with `ppanggolin draw --spots` (see [Spot plots](https://github.com/labgem/PPanGGOLiN/wiki/Outputs#spot-plots), and which can display modules.
+This information can also be visualized through figures that can be drawn with `ppanggolin draw --spots` (see [Spot plots](https://github.com/labgem/PPanGGOLiN/wiki/Outputs#spot-plots), and which can display modules.)
diff --git a/docs/user/Flat/orgStat.md b/docs/user/Flat/orgStat.md
@@ -24,6 +24,6 @@ This file is made of 15 columns described in the following table
 
 It can be generated using the 'write' subcommand as such : 
 
-`ppanggolin write -p pangenome.h5 --stats`
+`ppanggolin write_pangenome -p pangenome.h5 --stats`
 
 This command will also generate the 'mean_persistent_duplication.tsv' file.
diff --git a/docs/user/Flat/partition.md b/docs/user/Flat/partition.md
@@ -2,4 +2,4 @@ Those files will be stored in the 'partitions' directory and will be named after
 
 You can generate those files as such :  
 
-` ppanggolin write -p pangenome.h5 --partitions`
+` ppanggolin write_pangenome -p pangenome.h5 --partitions`
diff --git a/docs/user/Flat/presAbs.md b/docs/user/Flat/presAbs.md
@@ -5,12 +5,12 @@ This file is basically a presence absence matrix. The columns are the genomes us
 
 It can be generated using the 'write' subcommand as such : 
 
-`ppanggolin write -p pangenome.h5 --Rtab`
+`ppanggolin write_pangenome -p pangenome.h5 --Rtab`
 
 ### matrix
 
 This file is a .csv file following a format alike the gene_presence_absence.csv file generated by [roary](https://sanger-pathogens.github.io/Roary/), and works with [scoary](https://github.com/AdmiralenOla/Scoary) if you want to do pangenome-wide association studies.
 
 It can be generated using the 'write' subcommand as such : 
 
-`ppanggolin write -p pangenome.h5 --csv`
+`ppanggolin write_pangenome -p pangenome.h5 --csv`
diff --git a/docs/user/Flat/proksee.md b/docs/user/Flat/proksee.md
@@ -0,0 +1,31 @@
+The `--proksee` argument generates JSON map files containing pangenome annotations, which can be visualized using Proksee at [https://proksee.ca/](https://proksee.ca/).
+
+To generate JSON map files, you can use the following command:
+
+```bash
+ppanggolin write_genomes -p pangenome.h5 --proksee -o output
+```
+
+This command will create a proksee directory within the output directory, with one JSON file per genome. 
+
+
+To load a JSON map file on Proksee, follow these steps:
+1. Navigate to the "Map JSON" tab.
+2. Upload your file using the browse button.
+3. Click the "Create Map" button to generate the visualization.
+
+A genome visualized by Proksee with PPanGGOLiN annotation appears as depicted below:
+
+
+```{image} ../_static/proksee_exemple_A_baumannii_AYE.png
+:align: center
+```
+
+*Image: Genome visualized by Proksee with PPanGGOLiN annotation.*
+
+
+The visualization consists of three tracks:
+- **Genes:** Color-coded by their gene family partition.
+- **RGP (Region of Genomic Plasticity):** Spot associated to the RGPs are specified in the annotation of the object.
+- **Module:** Displaying modules within the genome. The completion of the module is specified in the annotation of the object.
+
diff --git a/docs/user/Flat/projection.md → docs/user/Flat/tables.md b/docs/user/Flat/projection.md → docs/user/Flat/tables.md
@@ -1,4 +1,4 @@
-This option writes in a 'projection' directory. There will be a file written in the .tsv file format for every single genome in the pangenome.
+This option writes in a 'tables' directory. There will be a file written in the .tsv file format for every single genome in the pangenome.
 The columns of this file are described in the following table : 
 
 | Column               | Description                                                                                                                    |
@@ -18,4 +18,4 @@ The columns of this file are described in the following table :
 
 Those files can be generated as such : 
 
-`ppanggolin write -p pangenome.h5 --projection`
+`ppanggolin write_genomes -p pangenome.h5 --tables`
diff --git a/docs/user/Outputs.md b/docs/user/Outputs.md
@@ -5,9 +5,9 @@ PPanGGOLiN provides multiple outputs to describe a pangenome. In this section th
 
 In most cases it will provide with a HDF-5 file named "pangenome.h5". This file stores all the information about your pangenome and the analysis that were run. If given to ppanggolin through most of the subcommands, it will read information from it. This is practical as you can regenerate figures or output files, or rerun parts of the analysis without redoing everything.
 
-In this section, each parts will describe a possible output of PPanGGOLiN, and will be commented with the command line that generates it using the HDF5 file, which is assumed to be called 'pangenome.h5'.
+In this section, each part will describe a possible output of PPanGGOLiN, and will be commented with the command line that generates it using the HDF5 file, which is assumed to be called 'pangenome.h5'.
 
-When using the same subcommand (like 'write' or 'draw' that can help you generate multiple file each), you can provide multiple options to write all of the file formats that you desire at once.
+When using the same subcommand (like 'write_pangenome' or 'draw' that can help you generate multiple file each), you can provide multiple options to write all of the file formats that you desire at once.
 
 ## PPanGGOLiN figures outputs
 
@@ -23,11 +23,14 @@ When using the same subcommand (like 'write' or 'draw' that can help you generat
 ```{include} Figures/spots.md
 ```
 
-## Rarefaction
+### Rarefaction
 ```{include} Figures/rarefaction.md
 ```
 
-## Write
+##  Write flat outputs describing the pangenome
+
+Writes 'flat' files that describe the pangenome and its elements with the command `write_pangenome`.
+
 ### Organisms statistics
 ```{include} Flat/orgStat.md
 ```
@@ -39,7 +42,6 @@ The pangenome's graph can be given through multiple data formats, in order to ma
 ```{include} graphOut/GEXF.md
 ```
 
-
 #### json
 ```{include} graphOut/JSON.md
 ```
@@ -51,14 +53,6 @@ The pangenome's graph can be given through multiple data formats, in order to ma
 ```{include} Flat/dupplication.md
 ```
 
-### partitions
-```{include} Flat/partition.md
-```
-
-### projection
-```{include} Flat/projection.md
-```
-
 ### Gene families and genes
 ```{include} Flat/fam2gen.md
 ```
@@ -71,6 +65,34 @@ The pangenome's graph can be given through multiple data formats, in order to ma
 ```{include} Flat/module.md
 ```
 
+### Partitions
+```{include} Flat/partition.md
+```
+
+## Write genomes with pangenome annotations
+
+Writes 'flat' files that represent the genomes along with their associated pangenome elements with command `write_genomes`.
+
+
+
+### Table with pangenome annotations
+```{include} Flat/tables.md
+```
+### GFF file
+```{include} Flat/gff.md
+```
+### JSON Map for Proksee visualisation
+```{include} Flat/proksee.md
+```
+### Adding Fasta Sequences into GFF and proksee JSON map Files
+
+```{include} Flat/genomes_fasta.md
+```
+
+### Incorporating Metadata into Tables, GFF, and Proksee Files
+```{include} Flat/genomes_metadata.md
+```
+
 ## Fasta
 ```{include} sequence/fasta.md
 ```
Original file line number	Diff line number	Diff line change
Expand Up		@@ -4,4 +4,4 @@ It is basically a three-column file listing the gene family name in the first co

		You can obtain it as such :

		`ppanggolin write -p pangenome.h5 --families_tsv`
		`ppanggolin write_pangenome -p pangenome.h5 --families_tsv`
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,4 +2,4 @@ Those files will be stored in the 'partitions' directory and will be named after

		You can generate those files as such :

		` ppanggolin write -p pangenome.h5 --partitions`
		` ppanggolin write_pangenome -p pangenome.h5 --partitions`