-
Notifications
You must be signed in to change notification settings - Fork 4
14. lsaBGC Easy Tutorial
Running lsaBGC-Easy can be as simple as:
# with the lsaBGC conda environment activated!
lsaBGC-Easy.py -n "Cutibacterium avidum" -g Additional_Genomes/ -c 20 -o lsaBGC_for_Cutibacterium_avidum/ -om panaroo
Here "Cutibacterium avidum" is the name of the taxa of interest. Additional_Genomes/
contains additional genomes in FASTA format which are absent in NCBI's Genbank database. E.g. these could be genomes that you might have recently sequenced and not made public yet. It is optional and you could run an analysis using entirely public genomes. The -om panaroo
argument requests usage of Panaroo instead of OrthoFinder, which makes sense here because we are looking at a single bacterial species. The number of threads requested is 20 (default is 4). And finally, the lsaBGC_for_Cutibacterium_avidum/
is the resulting directory.
This command took us approximately 15 minutes to run from start to finish (but note it is using 20 threads)!
If you want to just use lsaBGC-Easy for a restricted analysis on genomes you have available in FASTA format in some directory called Genomes/
, you can run lsaBGC-Easy.py
as such:
lsaBGC-Easy.py -n "None" -g Genomes/ -c 40 -o lsaBGC_Easy_Results/
This will not download additional genomes from NCBI.
Dereplication is automatically performed to select a smaller set of representative genomes from those provided, but if you wish to be inclusive of all genomes, you can turn off dereplication with the -sd
flag.
If you are confident that your server is able to handle a large number (N) of primary genomes, which will require an N^2 operation for performing all vs. all BLAST/DIAMOND analysis, you can override safety limits by issuing -x
. This will allow you to go past the default limit of 100 genomes, to a maximum of 300, following dereplication to prevent really long compute times.
Note, if running workflows via Docker, antiSMASH will simply run automatically if requested and does not require the multi-step described below.
To run lsaBGC-Easy with antiSMASH or DeepBGC, you will need to have these programs installed separately, because they are not included in the conda environment for lsaBGC (due to conflicts). In such a case, you will just run lsaBGC-Easy twice:
1. First run lsaBGC-Easy.py to initiate and setup task file for antiSMASH/DeepBGC predictions:
# with the lsaBGC conda environment activated!
lsaBGC-Easy.py -n "Cutibacterium avidum" -g Additional_Genomes/ -c 20 -o lsaBGC_for_Cutibacterium_avidum/
2. Run antiSMASH/DeepBGC commands iteratively/in-parallel with separate conda environment:
# with the antiSMASH (v6+) or DeepBGC conda environment activated!
# iterative running BGC prediction commands 1-by-1:
bash lsaBGC_for_Cutibacterium_avidum/bgc_prediction.cmds
# OR
# run 5 jobs in parallel using parallelJobs.py (script in lsaBGC) - where each antiSMASH job uses 4 threads
python /path/to/lsabgc-installation/scripts/parallelJobs.py -i lsaBGC_for_Cutibacterium_avidum/bgc_prediction.cmds -p 5
3. Restart lsaBGC-Easy once antiSMASH/DeepBGC predictions have finished.
# with the lsaBGC conda environment activated!
lsaBGC-Easy.py -n "Cutibacterium avidum" -g Additional_Genomes/ -c 20 -o lsaBGC_for_Cutibacterium_avidum/
GECCO (by Carroll and Larralde et al 2021) is the default software for BGC prediction because it is lightweight and reliable. It is included as a dependency of the lsaBGC conda environment.
Note, arguments used in the automatically generated anitSMASH commands are currently simplistic and include only the arguments: --output-dir
, --gene-finding-tool none
, and --output-basename
to allow for compatibility with older versions of antiSMASH (though exclusively tested with v6.0.0). We plan to make the option list for antiSMASH adjustable, to allow for additional analyses offered within antiSMASH, in the next release of lsaBGC.
GenBank assembly accessions are gathered for a species or genus from a listing from the most recent GTDB release (R214 for >v1.36, R207 before). This is to make sure genomes meet some standard and are of reasonable quality (i.e. fairly complete and not contaminated) and to assure proper taxonomic assignment. Genomes are then downloaded for the gathered accessions using from the NCBI GenBank database using ncbi-genome-download. Then, GToTree is used to create a species tree from a set of single-copy-genes (SCGs). The protein alignment of SCGs is also used to perform dereplication of the genome set to speed performance and more appropriately estimate evolutionary stats downstream.
Next, GECCO (by default; but also can use antiSMASH or DeepBGC) is run or commands are printed for each of the dereplicated set of genomes. GECCO is the default because it is light-weight and part of the lsaBGC conda environment. To use antiSMASH or DeepBGC instead, lsaBGC-Easy.py
will print out a task file with antiSMASH/DeepBGC commands which can be run iteratively or in parallel by the user using a separate conda environment for either tool. Then the user would simply rerun the same lsaBGC-Easy.py
command as before to pick up where the workflow left off. Checkpointing for certain files/directories throughout the workflow should make it easy to restart if the program aborts at any stage.
After BGC predictions have been achieved, BiG-SCAPE or lsaBGC-Cluster.py
can be used to cluster BGCs into GCFs and lsaBGC-Ready.py
is used to generate the inputs needed for lsaBGC-AutoExpansion.py
or lsaBGC-AutoAnalyze.py
. Note, since v1.35, lsaBGC-AutoExpansion.py
is no longer run by default to search for missing pieces/instances of GCFs due to assembly fragmentation. This is because it can lead to false positives for BGC-rich taxa (e.g. those with > 10 BGCs per genome) and in such cases lsaBGC-AutoExpansion.py
should be run manually, results investigated via lsaBGC-See.py
and false positive BGCs removed from GCF listings. For species- or lineage-specific analyses, in particular of taxa with fewer BGCs, we do recommend turning on lsaBGC-AutoExpansion.py
via the --perform_auto_expansion
flag to lead to better sensitivity and holistic understanding of GCF distributions phylogenetically
Finally, lsaBGC-AutoAnalyze.py
is run and produces the major results of interest to the user which are described on another Wiki page. GSeeF, since v1.33, is now also run as part of the workflow.
lsaBGC-Easy is highly automated but results should be interpreted with caution. In particular, please consider:
- Whether the default clustering of BGCs using thresholds equivalent to what we used for the investigation of four skin genera in the lsaBGC manuscript which may not be optimal for the genus/species you are looking at.
- Manual running of lsaBGC-Cluster can produce a report which showcases how GCFs might be comprised of multiple BGC classes, whether GCFs often are multi-copy per genome, etc. that allows users to select the delineation best suited for their research interests.
- BiG-SCAPE clustering, where the authors had optimized clustering thresholds based on a BGC-to-metabolite matched dataset, can also be requested via the ‘--use-bigscape’ argument. BiG-SCAPE also features a neat "glocal" mode for grouping together BGCs into GCFs that might be of interest.
- lsaBGC-AutoExpansion is turned off by default, this can result in missing the presence of GCFs in highly fragmented draft genomes because only complete BGCs (those far from scaffold/contig edges) are used by default as well. This can lead to faulty conclusions around phylogenetic breadth of GCFs. We thus recommend turning on the flag to run lsaBGC-AutoExpansion if dealing with taxa with low BGC counts per genome (where false positive detection is less likely), such as the four genera we investigated in the lsaBGC manuscript, or manually running lsaBGC-AutoExpansion and visually inspecting BGC predictions, e.g. using lsaBGC-See, to manually prune GCF listings for false positives.
- lsaBGC-AutoExpansion, when requested, is used to attempt to detect fuller instances of GCFs and avoid incorporation of more partial BGCs which are on contig/scaffold edges - however if a full version of such a GCF is not detected in another genome - then such rare GCFs could potentially be disregarded.
- lsaBGC-AutoExpansion has the potential to expand BGC prediction boundaries by antiSMASH, GECCO, and DeepBGC and synchronize the inclusion of auxiliary flanking genes as we showed in the lsaBGC manuscript. Currently, lsaBGC-AutoExpansion prioritizes retention of original BGC delineations by antiSMASH, GECCO, DeepBGC, but we might change this in future versions of the suite.
- To overcome differences in how BGC location is presented in antiSMASH, DeepBGC, and GECCO GenBanks, since version 1.37 we parse the location of BGCs in the GenBanks based on the type of BGC prediction software. For GECCO, where location information is not provided within the GenBank, we map BGCs to the genome to determine their location. BGCs which do not map 100% to the original genome are currently ignored from downstream analyses and are noted in a file within the lsaBGC-Ready results subdirectory.
- Note, homology could potentially be over or under-estimated for multi-domain proteins. Generally, GCF clustering is less affected because auxiliary "shell" proteins further help partition BGCs with similar core/key-domain containing proteins. For modular multi-domain proteins, such as NRPSs or PKSs, under-clustering might result if intra-taxa HGT of modules occurred. This is more likely for BGC-rich and high-diversity taxonomic groups.
- A key feature of the OrthoFinder methodology, which we use for ortholog group determination, is gene-length normalization (described in their first paper from 2015). In addition, we request more refined "hierarchical" ortholog groups to be computed by default because empirically it seems to better partition orthologs from paralogs, however, the duplication-loss-coalescent model used to achieve HOGs is not designed with bacteria in mind where horizontal gene transfer can be common (described in second paper from 2019). For the lsaBGC paper we thus used the coarser OrthoFinder clusters and these can still be requested via the
-mc
/run_coarse_orthofinder
. This will lead to more coarse ortholog groups. Note, lsaBGC marks if ortholog groups are found in multiple copies within the context of the GCF in the resulting spreadsheets. In addition, sometimes genes are grouped within coarser ortholog groups but not the more resolute hierarchical groups. Previously, (in versions 1.40 and earlier) we binned the remaining genes from a coarse orthogroup not found in HOGs together as an ortholog group with the prefix "OG" instead of "HOG". See the following GitHub ticket for further information. Because these missing genes in the "HOGs" are presumed putative xenologs marked by OrthoFinder, see this other GitHub ticket, in version 1.50, we changed the handling of these genes which drop off in the HOGs to try re-incorporate such genes into their nearest HOGs where possible, for the rest, we treat them as singletons. - In version v1.50, we also introduced Panaroo as a requestable option for finding ortholog groups. Panaroo is highly scalable and a great, highly accurate, method for inferring ortholog groups within the context of a single bacterial species or lineage, it will by default trigger the
--ignore_limits
flag in lsaBGC-Easy.py. -
Historical note: In version v1.38, we introduced SonicParanoid2 as an alternate method for determining ortholog groups, requestable via
-om sonicparanoid
. Due to difficulties in managing conflicts between dependencies, we unfortunately decided to later remove SonicParnaoid2 in v1.53.
lsaBGC-DiscoVary.py
- the metagenomics part of lsaBGC - is not currently integrated into lsaBGC-Easy.py
but might be in the future! This will likely look like the ability to provide metagenomes as an additional argument to search for the taxa-specific GCF instances in them + mine for novel SNVs (similar to and inspired by approaches for mining for novel variants of COVID).
lsaBGC-Euk-Easy allows users to run lsaBGC-Easy for eukaryotes, primarily fungal genomes. Note, the input is different than lsaBGC-Easy. The input for lsaBGC-Euk-Easy is a directory with full GenBanks (GenBanks with gene-calling already performed and provided as CDS features).
# with the lsaBGC conda environment activated!
# 1. download full GenBanks from NCBI using ncbi-genome-download for Aspergillus flavus
ncbi-genome-download -F genbank --flat-output -g "Aspergillus flavus" -o Aflavus_GenBanks_from_GenBank_All/ -s genbank fungi
# 2. run lsaBGC-Euk-Easy
lsaBGC-Euk-Easy.py -g Downloaded_Genomes/ -c 40 -o lsaBGC_for_Aspergillus_flavus/
Please note, lsaBGC-Euk-Easy is less well tested than lsaBGC-Easy so if issues pop up, please open up an issue/ticket on GitHub and share!
usage: lsaBGC-Easy.py [-h] -n TAXA_NAME [-g USER_GENOMES_DIRECTORY] -o OUTPUT_DIRECTORY [-p BGC_PREDICTION_SOFTWARE] [-x] [-gtm GTOTREE_MODEL] [-iib] [-b] [-bo BIGSCAPE_OPTIONS]
[-lci LSABGC_CLUSTER_INFLATION] [-lcj LSABGC_CLUSTER_JACCARD] [-lcr LSABGC_CLUSTER_SYNTENY] [-pae] [-sd] [-dt DEREPLICATE_THRESHOLD] [-pt POPULATION_THRESHOLD] [-py]
[-om ORTHOLOG_METHOD] [-mc] [-c CPUS] [-ao ANTISMASH_OPTIONS]
__ ___ _____ _____
/ / ___ ___ _ / _ ) / ___/ / ___/
/ / (_-</ _ `/ / _ |/ (_ / / /__
/_/ /___/\_,_/ /____/ \___/ \___/
*******************************************************************************************************************
Program: lsaBGC-Easy.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
QUICK DESCRIPTION:
Workflow to run the majority of lsaBGC functionalities for a specific taxa or manually defined set of user-provided
genomes or both. For information on the workflow and examples on how to run please take a look at the Wiki:
https://github.com/Kalan-Lab/lsaBGC/wiki/14.-lsaBGC-Easy-Tutorial
If interested in fungal investigation please check out lsaBGC-Euk-Easy.py.
*******************************************************************************************************************
CONSIDERATIONS:
* Check out the considerations wiki page: https://github.com/Kalan-Lab/lsaBGC/wiki/00.-Background-&-Considerations
* If interested in running just a directory of genomes (you don't want to downoad genomes from Genbank for some taxa)
then you can just set the required -n argument to "None".
* Using Panaroo for orthology will automatically result in turning on the '--ignore_limits' flag which will allow
up to 2000 genomes from proceeing from processing to lsaBGC-Ready and onwards.
*******************************************************************************************************************
OVERVIEW OF STEPS TAKEN:
- Check number of genomes for taxa is not too crazy (<1000) or too small (>10)
- Step 1: Get set of Genbank assembly accessions from recent GTDB release matching taxa.
- Step 2: Download all genomes in FASTA format using ncbi-genome-download and perform gene-calling with prodigal.
- Step 3: Run GToTree, Dereplicate, Group Samples into Populations/Clades, and Create genomes listing
- Step 4: Run GECCO based annotation of BGCs and crete BGC listing or Create Task File with antiSMASH/DeepBGC commands
- Step 5: Run lsaBGC-Ready.py with lsaBGC-Cluster or BiG-SCAPE
- Step 6: Run lsaBGC-AutoExpansion.py polishing to find GCF instances fragmented on multiple scaffolds
- Step 7: Run lsaBGC-AutoAnalyze.py
- Step 8: Run GSeeF.py
*******************************************************************************************************************
GToTree taxa models available (more resolute taxonomic groups will have more genes to use for building phylogeny):
(1) Actinobacteria, (2) Alphaproteobacteria, (3) Archaea, (4) Bacteria, (5) Bacteria_and_Archaea,
(6) Betaproteobacteria, (7) Chlamydiae, (8) Cyanobacteria, (9) Epsilonproteobacteria, (10) Firmicutes,
(11) Firmicutes, (12) Gammaproteobacteria, (13) Proteobacteria, (14) Tenericutes, (15) Universal_Hug_et_al
*******************************************************************************************************************
optional arguments:
-h, --help show this help message and exit
-n TAXA_NAME, --taxa_name TAXA_NAME
Name of the taxa of interest as listed in GTDB. If there is a space in the
name, please surround by quotes.
-g USER_GENOMES_DIRECTORY, --user_genomes_directory USER_GENOMES_DIRECTORY
A directory with additional genomes, e.g. those recently sequenced by the
user, belonging to the taxa. Accepted formats include FASTA.
Accepted suffices include: .fna, .fa, .fasta.
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Parent output/workspace directory.
-p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO). GECCO is
automatic, but for the other two, lsaBGC-Easy will produce a task-file which
you will need to run in a seperate environment with antiSMASH or DeepBGC installed,
then rerun lsaBGC-Easy.py using the original command [Default
is GECCO].
-x, --ignore_limits Ignore limitations on number of genomes allowed.
E.g. allow for analyses of taxa with more than 2000 genomes available before
dereplication and more than 100 genomes after dereplication. Not recommend, be
cautious!!! Also note, you can always delete "Dereplicated_Set_of_Genomes.txt"
in the results directory and redo dereplication with different threshold.
-gtm GTOTREE_MODEL, --gtotree_model GTOTREE_MODEL
SCG model for secondary GToTree analysis and what would be used for dereplication.
[Default is "Bacteria"].
-iib, --include_incomplete_bgcs
Whether to account for incomplete BGCs (those near contig edges) prior to clustering.
-b, --use_bigscape Use BiG-SCAPE for BGC clustering into GCFs instead of lsaBGC-Cluster. Recommended if
you want to include incomplete BGCs for clustering and are using antiSMASH.
-bo BIGSCAPE_OPTIONS, --bigscape_options BIGSCAPE_OPTIONS
Options for BiG-SCAPE clustering of BGCs if requested (should be surrounded by quotes).
[Default is "--hybrids-off --include_singletons"].
-lci LSABGC_CLUSTER_INFLATION, --lsabgc_cluster_inflation LSABGC_CLUSTER_INFLATION
Value for MCL inflation parameter to use in lsaBGC-Cluster [Default is 4.0].
-lcj LSABGC_CLUSTER_JACCARD, --lsabgc_cluster_jaccard LSABGC_CLUSTER_JACCARD
Minimal Jaccard Index cutoff to regard two BGCs as potentially homologous
in lsaBGC-Cluster [Default is 20.0].
-lcr LSABGC_CLUSTER_SYNTENY, --lsabgc_cluster_synteny LSABGC_CLUSTER_SYNTENY
Minimal absolute correlation coefficient to measure syntenic similarity and
regard two BGCs as potentially homologous in lsaBGC-Cluster [Default is 0.7].
-pae, --perform_auto_expansion
Perform lsaBGC-AutoExpansion.py to find missing pieces of BGCs due to assembly
fragmentation. Will increase sensitivity at the potential cost of false positives,
recommended for taxa with < ~10 BGCs per genome or more constrained
lineages/species. For genus-wide analyses, especially of BGC-rich taxa, please use
expansion manually and assess lsaBGC-See reports to filter false positives.
-sd, --skip_dereplication
Whether to skip dereplication based on GToTree alignments of SCGs - not
recommended and can cause issues if there are a lot of
genomes for the taxa of interest.
-dt DEREPLICATE_THRESHOLD, --dereplicate_threshold DEREPLICATE_THRESHOLD
Amino acid similarity threshold of SCGs for considering
two genomes as redundant [Default is 0.999].
-pt POPULATION_THRESHOLD, --population_threshold POPULATION_THRESHOLD
Amino acid similarity threshold of SCGs for considering
two genomes as belonging to the same population [Default is 0.99].
-py, --use_pyrodigal Use pyrodigal instead of prodigal.
-om ORTHOLOG_METHOD, --ortholog_method ORTHOLOG_METHOD
Software for inference of ortholog groups. (Options: OrthoFinder & Panaroo).
[Default is OrthoFinder].
-mc, --run_coarse_orthofinder
Use coarse clustering of homolog groups in OrthoFinder instead of more
resolute hierarchical determined homolog groups. There are some advantages to coarse
OGs, including their construction being deterministic.
-c CPUS, --cpus CPUS Total number of CPUs to use [Default is 4].
-ao ANTISMASH_OPTIONS, --antismash_options ANTISMASH_OPTIONS
Options for antiSMASH prediction analysis (should be surrounded by
quotes, in Docker - it is assumed each individual job will have 4 CPUs).
[Default is "--genefinding-tool none --cpus 4"]