PPanGGOLiN 2.1.0
New Features
- Write the translated sequence of genes using MMSeqs2 with the
--proteins
option (documentation), which works like the other options in the ppanggolin fasta command (added in PR #205). - Some information about contigs and genomes, such as organism name, strain, and dbx_ref information, is now extracted from annotation files (GBFF & GFF) and added to the pangenome as metadata (added in PR #227).
- The command
write_metadata
has been added to allow exporting metadata to TSV files. Check out the documentation for more details (added in PR #227). - Add
infer_singleton
option in the workflow (added in PR #239). - When clustering is given, it’s now possible to specify the representative gene of the cluster (added in PR #242).
Major Change
- Handling genes with joined coordinates (for example, frameshift) in input annotation files (GFF or GBFF). Such annotations were disregarded when encountered in GBFF files and improperly managed in GFF files. This change implies a change in writing gene sequences and, consequently, in clustering and, thus, in all pangenome results: graph, partition, RGP, spots, and modules. This change was measured and reported in PR #206. It is not huge on pangenomes, but needs to be known for future version comparisons. See also PR #240 and #249.
Minor Changes
- Ordering gene in the whole genome MSA file (added in PR #200).
- Replace the return in the try block with an else statement to return the value found in try (added in PR #204).
- When writing MSA, the partial gene is handled by removing the last one or two nucleotides to translate (added in PR #205).
- Change how method
get_genes
handles end position (added in PR #212). - Improve GitHub CI workflow (added in PR #216, #220, #224, #225).
- PPanGGOLiN now supports using the soft-link option when building the MMSeqs2 database via subprocess, reducing temporary directory size (added in PR #214 and #229).
- Report subprocess (MMSeqs2, MAFFT, etc.) error message if it crashes (reported in issue #210, added in PR #229).
- When parsing annotation files, CDS are translated using the translation table code specified by the
transl_table
tag. If this tag is missing, thetranslation_table
argument is now used, with a default value of 11 (reported in #226 and added in PR #230). - Added an identifier to metadata in object and HDF5. This helps to identify the right metadata in a cross-reference (added in PR #235).
- Make the subprocess more detailed with info and error messages (added in PR #237).
- Add the protein sequence to the gene family when reading clustering (added in PR #238).
- Add gene information in RGP output (added in PR #239).
- Improve metadata management in commands
projection
andrgp_cluster
(added in PR #244). - Some developments for the PANORAMA project 🤫 (added in PR #248).
Bug Fixes
- Fix the last genome missing in the whole genome MSA file (fixed in PR #200).
- Write only genes associated with the RGP when writing FASTA sequences for RGP (reported in issue #122, fixed in PR #202).
- Ensure proper handling of circular RGPs, addressing issues observed in the spot plot (reported in issue #124, fixed in PR #206).
- Fix gene ID mismatch in projection command with GBFF files as input genome (reported in issue #207, fixed in PR #208).
- Fix spot prediction in projection command (fixed in PR #209).
- Fix multiple spots per RGP handling in projection command (fixed in PR #211).
- Handle trailing whitespace at the end of GBFF file (reported in issue #203, fixed in PR #213).
- Correctly read "is_circular" from GFF files (fixed in PR #215).
- Fix RGP "looping" around circular contigs (fixed in PR #215).
- Write the gene name instead of the coordinates in RGP output files (reported in issue #218, fixed in PR #219).
- Write only the genes of the input genome in
gene_to_gene_family.tsv
file from projection (reported in issue #221, fixed in PR #228). - Fix
dup_margin
default value (reported in issue #223 and fixed in PR #234). - Fix missing
translation_table
handling (reported in issue #226 and fixed in PR #230). - Fix spots to modules output file always empty (fixed in PR #236).
- Manage chevron in GFF start and stop (fixed in PR #241).
- Ignore weird tRNA from Aragorn (fixed in PR #245).
- Fix display module on Proksee with gene overlapping contig (fixed in PR #246).
- Fix metadata-related issues (fixed in PR #247).
New Contributor
We thank @ktmeaton, who made their first contribution in #200. 🎉