-
Notifications
You must be signed in to change notification settings - Fork 4
13. Overview of lsaBGC AutoAnalyze's Final Results
There are 7 major results in the Final_Results/
subdirectory of the output directory generated by lsaBGC-AutoAnalyze.py
.
Note, example results from running lsaBGC-Easy.py
with antiSMASH, GECCO, and DeepBGC for Cutibacterium avidum can be found on Google drive.
Here are overviews of the major result files:
This is an Excel spreadsheet with 7 separate sheets:
-
Data Dictionary A link to the data-dictionary on
lsaBGC-PopGene.py
results to interpret results in the sheets "Overview - Simple" and "Overview - Full". - Samples used in AutoAnalyze - A listing of the samples used in lsaBGC-AutoAnalyze with clading/population information.
-
Overview - Simple - A simplified version of the
lsaBGC-PopGene.py
results. Rows represent individual homolog groups found in a GCF presented in a consensus order for easy viewing of the full secondary metabolome for the taxa of interest. - Multi-GCF HGs - Homolog groups which occur in multiple GCFs for follow-up using CORASON.
- GCF to Sample Listings - A listing showing all samples which have each GCF and the number of BGC fragments found to belong to the GCF per sample.
- GCFs and HGs to MIBiG Mapping - GCF/Homolog group mappings to MIBiG BGCs and proteins.
-
Overview - Full - The full version of the
lsaBGC-PopGene.py
results with more columns!
IMPORTANT NOTE: Please make sure to pay attention to the column titled single-copy_in_GCF_context
in the Overview
sheets to inform whether evolutionary statistics can be properly interpreted for a homolog group.
The "Overview" reports are automatically sorted by the consensus order of homolog groups allowing for easy intuitive scrolling. For full information on each column, check out the data dictionary table on the lsaBGC-PopGene.py
wiki.
For more advanced evolutionary statistics, I recommend primarily consulting the Tajima's D and Median Beta-RD statistics, dN/dS & Fst are experimental implementations!!! and currently not very informative.
Median Beta-RD is the median value of pairwise calculations of gene divergence to expected divergence for a pair of genomes based on genome-wide values such as ANI/AAI or divergence of single-copy core genes (e.g. ribosomal proteins).
Tajima's D can be most intuitively thought of as the proportion of high-frequency to low-frequency minor allele positions along the multiple sequence alignment of a gene. Because microbial genomic datasets can be biased and feature over-representation of certain lineages, it is thus important to consider de-replication of the input set of genomes prior to lsaBGC-PopGene.py
/lsaBGC-AutoAnalyze.py
analysis to properly compute statistics and have them be representative broadly for the taxa.
Screenshot of "Overview - Simple" sheet with automatic conditional formatting/coloring of select columns:
Consensus_Sequence_Similarity_of_Homolog_Groups.pdf shows the distribution of GCFs and individual homolog groups across a species phylogeny. Coloring represents similarity of homolog group sequence to the consensus sequence across all samples (blue=similar to consensus, white=differing from consensus). Inspired by figure from Drott et al. 2021. Only shows GCFs found in >10% of all samples in the analysis. Homolog groups are shown in the consensus order from left to right.
Phylogeny_with_Labels.pdf provides complementary info on things such as leaf names and population/clading IDs that would make the other plot to cramped!
A multi-page PDF showing the conservation of homolog groups and a consensus order schematic for each GCF. Tajima's D and Beta-RD are also shown on the GCF schematic!
PDF of boxplots showing the full Beta-RD distribution for each GCF found in >10% of samples. Beta-RD is the divergence of GCF proteins to expected divergence (either the average amino identity or based on single copy core gene alignments from GToTree) for each pair of genomes. So the distribution is made of all pairwise comparisons. The three separate rows correspond to different Jaccard index thresholds with regards to homolog groups shared in common between pairs of GCFs for consideration.
A FASTA file with representative sequences of homolog groups found in multiple GCFs of particular interest for further exploration using CORASON/EvoMining!
A directory with results from lsaBGC-See.py
which shows the distribution of GCF homolog groups across the species phylogeny. A key factor here is that lsaBGC-See.py
is able to show fragmented pieces of the same GCF side by side along the phylogeny.
A directory with results from lsaBGC-ComprehenSeeIve.py
. Unlike lsaBGC-See.py
, these plots show the presence of homolog groups related to a GCF across all samples in the analysis. This is largely developed for lsaBGC-(Euk)-Easy workflows where OrthoFinder was run on all genomes upfront and this information is readily available and can be used to assess whether GCF instances are likely missing (due to skipping lsaBGC-AutoExpansion, default) or potentially falsely predicted (due to running lsaBGC-AutoExpansion). For lsaBGC-Easy workflows, the phylogeny corresponds to the species phylogeny computed by GToTree and the leaf tips are colored according to whether an instance of the GCF was detected, and if so whether by direct prediction in the genomes by antiSMASH, GECCO, or DeepBGC or by lsaBGC-AutoExpansion.