Skip to content

Commit

Permalink
edit module doc ready to review
Browse files Browse the repository at this point in the history
  • Loading branch information
JeanMainguy committed Dec 5, 2023
1 parent a99a470 commit e6cee8e
Show file tree
Hide file tree
Showing 2 changed files with 105 additions and 68 deletions.
139 changes: 87 additions & 52 deletions docs/user/Modules/moduleOutputs.md
Original file line number Diff line number Diff line change
@@ -1,85 +1,120 @@
## Module outputs

### Functional modules
This `.tsv` file lists the modules and the gene families that belong to them. It lists one family per line, and there are multiple line for each module.
It is written along with other files with the following command:
`ppanggolin write_pangenome -p pangenome.h5 --modules`

### Descriptive Tables for Predicted Modules

To describe predicted modules, various files can be generated, each delineating distinct characteristics of these modules.

To generate these tables, use the `write_pangenome` command with the `--module` :

```bash
ppanggolin write_pangenome -p pangenome.h5 --modules -o my_output_dir
```

This command generates three tables: `functional_modules.tsv`, `modules_in_genomes.tsv`, and `modules_summary.tsv` described below:


#### 1. Gene Family to Module Mapping Table

The `functional_modules.tsv` file lists modules with their corresponding gene families. Each line establishes a mapping between a gene family and its respective module.

It follows the following format:
|column|description|

|Column|Description|
|------|------------|
|module_id| The module identifier|
|family_id| the family identifier|
|module_id| Identifier for the module|
|family_id| Identifier for the family|

### Modules in organisms
This .tsv file lists for each organism the modules that are present and how complete they are. Since there are some variability that are allowed in the module predictions, occasionnally some modules can be incomplete in some of the organisms where they are found.
This file is written along with other files with the following command:
`ppanggolin write_pangenome -p pangenome.h5 --modules`

And it follows the following format:
|column|description|
|------|------------|
|module_id| The module identifier|
|organism| the organism which has the indicated module|
|completion| a value between 0.0 and 1.0 which indicates how complete (in terms of gene family) the module is in the given organism|
#### 2. Genome-wise Module Composition

The `modules_in_genomes.tsv` file provides a comprehensive overview of the modules present in each genome, detailing their completeness levels. Due to potential variability in module predictions, some modules might exhibit partial completeness in specific genomes where they are detected.

The structure of the `modules_in_genomes.tsv` file is outlined as follows:

| Column | Description |
|--------------|-----------------------------------------------|
| module_id | Identifier for the module |
| genome | Genome in which the indicated module is found |
| completion | Indicates the level of completeness (0.0 to 1.0) of the module in the specified genome based on gene family representation |

### modules summary
This .tsv file lists a few characteristics for each detected module. There is one line for each module.
The file is written along with other files with the following command:
`ppanggolin write_pangenome -p pangenome.h5 --modules`

#### 3. modules summary

The `modules_summary.tsv` file lists a few characteristics for each detected module. There is one line for each module.

And it follows the following format:

|column|description|
|------|------------|
|module_id| The module identifier|
|nb_families| The number of families which are included in the module The families themselves are listed in the 'functional_modules.tsv' file.|
|nb_organisms|The number of organisms in which the module is found. Those organisms are listed in the 'modules_in_organisms.tsv' file.|
|nb_genomes|The number of genomes in which the module is found. Those genomes are listed in the 'modules_in_genomes.tsv' file.|
|partition| The average partition of the families in the module.|
|mean_number_of_occurrence| the mean number of time a module is present in each organism. The expected value is around one, but it can be more if it is a module often repeated in the genomes (like a phage).|
|mean_number_of_occurrence| the mean number of time a module is present in each genome. The expected value is around one, but it can be more if it is a module often repeated in the genomes (like a phage).|

### spot modules

This command is available only if both modules and spots have been computed for your pangenome (see the command `all`, or the commands `spot` and `module` for that).
It indicates which modules are present in which spot and in which RGP.
The files are written with the following command:
### Mapping Modules with Spots and Regions of Genomic Plasticity (RGPs)

Predicted modules can be associated with Spots of insertion and Regions of Genomic Plasticity (RGPs) using the `write_pangenome` command with the `--spot_modules` flag as follows:

`ppanggolin write_pangenome -p pangenome.h5 --spot_modules`
```bash
ppanggolin write_pangenome -p pangenome.h5 --spot_modules -o my_output_dir
```

This command generates two tables: `modules_spots.tsv` and `modules_RGP_lists.tsv`, described below.

The format of the 'modules_spots.tsv' file is the following:
```{note}
These outputs are available only if modules, spots, and RGPs have been computed in your pangenome (see the command [`all`](../QuickUsage/quickWorkflow.md#ppanggolin-complete-workflow-analyses) or the commands [`spot`](../RGP/rgpPrediction.md#spot-prediction), [`rgp`](../RGP/rgpPrediction.md#rgp-detection), and [`module`](./modulePrediction.md#conserved-module-prediction) for that).
```

|column|description|
|------|------------|
|module_id| The module identifier|
|spot_id| the spot identifier|
Moreover, this information can be visualized through figures using the command `ppanggolin draw --spots` (refer to [Spot plots](../RGP/rgpOutputs.md#draw-spots), which can display modules).

The file `modules_RGP_lists.tsv` lists RGPs that have the same modules. Those RGPs can have different gene families, however they will not have any other module than those that are indicated. The format of the 'modules_RGP_lists.tsv' is the following:
#### 1. Associating Modules and Spots

|column|description|
|------|------------|
|representative_RGP| an RGP deemed representative for the group, and serving as a 'group of rgp id'(randomly picked)|
|nb_spots| The number of spots in which we see the RGPs which have the modules listed afterwards|
|mod_list| a list of the modules that are in the indicated RGPs|
|RGP_list| a list of RGP that include exactly the modules listed previously|
The `modules_spots.tsv` file indicates which modules are present in each spot.

Its format is as follows:

| Column | Description |
|------------|--------------------|
| module_id | Module identifier |
| spot_id | Spot identifier |

#### 2. Associating Modules and RGPs

The `modules_RGP_lists.tsv` file lists RGPs that contain the same modules. These RGPs may have different gene families, but they will not include any other modules apart from those indicated. The format of `modules_RGP_lists.tsv` is as follows:

This information can also be visualized through figures that can be drawn with `ppanggolin draw --spots` (see [Spot plots](../RGP/rgpOutputs.md#draw-spots), and which can display modules).
| Column | Description |
|--------------------|---------------------------------------------------------------------------------------------------|
| representative_RGP | An RGP considered representative for the group, serving as a randomly chosen 'group of RGP IDs' |
| nb_spots | The number of spots where the RGPs containing the listed modules are observed |
| mod_list | A list of the modules present in the indicated RGPs |
| RGP_list | A list of RGPs that specifically include the previously listed modules |


### Module information
<!-- TODO: Need to be reformulate I think.. -->
It could be necessary to get more information about the modules.
Here we provide information about families, and we separate modules in
function of the partition. You can get this supplementary information
as such :

### Module Information

To gather additional insights into the modules, including information about families and their distribution across different partitions, you can use the following command:

```bash
ppanggolin metrics -p pangenome.h5 --info_modules
```

The command output provides the following details:

```yaml
- Modules: 3
- Families in Modules: 15
- Percent of Families:
- persistent: 0.0
- shell 53.33
- cloud 46.67
- Number of Families per Modules:
- min: 3
- max: 8
- sd: 2.65
- mean: 5
```
Modules : 3
Families in Modules : 22 (min : 5, max : 9, sd : 2.08, mean : 7.33)
Sheel specific : 36.36 (sd : 4.62, mean : 2.67)
Cloud specific : 63.64 (sd : 4.51, mean : 4.67)
```

34 changes: 18 additions & 16 deletions docs/user/Modules/modulePrediction.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,37 @@
## panModule
# Conserved module prediction

Again, it works like 'workflow' but you can detect the conserved modules in your pangenome, you can use the **panModule** workflow, as such:
PPanGGOLiN is able to predict and work with conserved modules. Modules are groups of genes that are part of the variable genome, and often found together across the genomes of the pangenome. As such, they are conserved modules and potential functional modules.

```bash
ppanggolin panmodule --fasta ORGANISMS_FASTA_LIST
```
Further details can be found in the [panModule preprint](https://doi.org/10.1101/2021.12.06.471380)

The module prediction is launched after the pangenome partitionning with the default parameters.
If you want to tune the module detection, you can use the `module` command after the `workflow`.
## The panModule workflow

The panModule workflow facilitates the generation of a pangenome with predicted conserved modules from a specified set of genomes. This command extends the functionality of the `workflow` command by detecting conserved modules. Additionally, it generates descriptive TSV files detailing the predicted modules, whose format are detailed [here](./moduleOutputs.md).

Further details can be found in the [panModule publication](https://doi.org/10.1101/2021.12.06.471380) as well as in the section.
To execute the panModule workflow, use the following command:

## Predict conserved module
```bash
ppanggolin panmodule --fasta GENOME_LIST_FILE
```
Replace `GENOME_LIST_FILE` with a tab-separated file listing the genome names, and the fasta filepath of its genomic sequences as described [here](../PangenomeAnalyses/pangenomeAnnotation.md#annotate-fasta-file). Alternatively, provide a list of GFF/GBFF files as input by utilizing the `--anno` parameter, similar to how it's used in the workflow and annotate commands.

The panmodule workflow predict modules with default parameters. To fine-tune the detection, you have the option to use the `module` command on a partionned pangenome acquired through the `workflow` for example or use a configuration file, detailed further [here](../practicalInformation.md#configuration-file).


It is possible to predict and work with conserved modules using PPanGGOLiN. Modules are groups of genes that are part of the variable genome, and often found together in the different genomes. As such, they are conserved modules and potential functional modules.
## Predict conserved module

Once partitions have been computed, you can predict conserved modules. All the options of the `module` subcommand are for tuning the parameters for the analysis.
Details about each parameter and what they do is available in the related [preprint](https://www.biorxiv.org/content/10.1101/2021.12.06.471380v1).
The `module` command predicts conserved modules on an partioned pangenome. The command has several options for tuning the prediction. Details about each parameter are available in the related [preprint](https://www.biorxiv.org/content/10.1101/2021.12.06.471380v1).

The command can be used simply as such:

`ppanggolin module -p pangenome.h5`

This will predict modules and store the results in the HDF5 file. If you wish to have descriptive tsv files, whose format is detailed [here](./moduleOutputs.md), you can use:
This will predict modules and store the results in the HDF5 pangenome file. If you wish to have descriptive tsv files, whose format is detailed [here](./moduleOutputs.md), you can use the `write_pangenome` command with the flag `--modules`:

`ppanggolin write -p pangenome.h5 --modules --output MYOUTPUTDIR`.
`ppanggolin write_pangenome -p pangenome.h5 --modules --output MYOUTPUTDIR`.

If your pangenome has spots of insertion that were predicted using the `spot` command (or the `panrgp` or `all` commands), you can also list the associations between the predicted spots and the predicted modules as such:
If spots of insertion have been predicited in you pangenome using the `spot` command (or inside the `panrgp` or `all` workflow commands), you can also list the associations between the predicted spots and the predicted modules as such:

`ppanggolin write -p pangenome.h5 --spot_modules --output MYOUTPUTDIR`
`ppanggolin write_pangenome -p pangenome.h5 --spot_modules --output MYOUTPUTDIR`

The format of each file is given [here](./moduleOutputs.md)

0 comments on commit e6cee8e

Please sign in to comment.