Skip to content

06. Refinement of BGCs Belonging to GCF

Rauf Salamzade edited this page Aug 1, 2022 · 4 revisions

lsaBGC-Refiner.py

Boundary prediction of BGCs by AntiSMASH is often an approximation and can conjoin multiple discrete BGCs if they are located nearby each other on a genome. This issue was highlighted in a recent study which explored the BGC diversity of the genus Bacillus by the Kovacs lab.

We thus developed lsaBGC-Refiner.py to allow users to automatically curate BGCs belonging to a GCF and retain only those found in between two user specified homolog groups.

Example

As an example, one of the major BGCs we found in M. luteus genomes was predicted to encode for a terpene related metabolite. Netzer et al. 2010 functionally characterized this BGC and identified the key genes and their role in generating the terpene carotenoid sarcinaxanthin. Using it as a reference we whittled the raw BGC predictions by AntiSMASH to produce refined BGC genbanks and visualized the results pre/post refinement with lsaBGC-See.py.

Usage

usage: lsaBGC-Refiner.py [-h] -g GCF_LISTING -m ORTHOFINDER_MATRIX [-i GCF_ID] -o OUTPUT_DIRECTORY [-p BGC_PREDICTION_SOFTWARE] -b1 FIRST_BOUNDARY_HOMOLOG -b2 SECOND_BOUNDARY_HOMOLOG

        Program: lsaBGC-Refiner.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        This program will take in a list of homologous (ideally orthologous) BGC genbanks belonging to a single GCF and
        whittle them down to include only annotations/features in between user specified homolog groups. It is particularly
        useful for curation of GCFs which featuere distinct BGCs aggregated together due to close physical proximity as
        described in: https://msystems.asm.org/content/6/2/e00057-21/article-info


optional arguments:
  -h, --help            show this help message and exit
  -g GCF_LISTING, --gcf_listing GCF_LISTING
                        BGC listings file for a gcf. Tab delimited: 1st column lists sample name while the 2nd column is the path to a BGC prediction in Genbank format.
  -m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
                        OrthoFinder homolog by sample matrix.
  -i GCF_ID, --gcf_id GCF_ID
                        GCF identifier.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Output directory.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -b1 FIRST_BOUNDARY_HOMOLOG, --first_boundary_homolog FIRST_BOUNDARY_HOMOLOG
                        Identifier for the first homolog group to be used as boundary for pruning BGCs..
  -b2 SECOND_BOUNDARY_HOMOLOG, --second_boundary_homolog SECOND_BOUNDARY_HOMOLOG
                        Identifier for the second homolog group to be used as boundary for pruning BGCs.