Skip to content

Cluster munitions for microbial genomic sequence data

License

Notifications You must be signed in to change notification settings

rotheconrad/HeatSeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HeatSeq™: cluster munitions for microbial genomics

HeatSeq™ Logo

HeatSeq™ allows the user to quickly and easily predict and visualize distance based clusters for all vs. all ANI, AAI, Mash distance and others. Excellent for exploring distance based genomic clades or sample groups. Developed to investigate intraspecies clades such as genomovars and phylogroups.

HeatSeq™ uses hierarchical clustering to generate distance based cluster predictions for genomic (ANI, AAI) and metagenomic (beta distance, MASH, Simka) data, or any other all vs. all distance type matrix, and it builds a clustered heatmap with column labels to highlight predicted cluster assignments and other user provided metadata. It outputs a seaborn clustermap in vector based PDF format along with separate PDF legend corresponding to each metadata row. It also outputs a metadata tab separated value (tsv) file with the predicted cluster assignments for each cluster threshold for each genome, metagenome, or sample along with a separate tsv file specifying the color assignments. These output files may be modified by the user and input back into the code to customize the final figure output.

Table of Contents

  1. Usage
    1. Case One: Default
    2. Case Two: User Defined
    3. Case Three: Custom Metadata
  2. Examples
    1. fastANI
    2. ANI
    3. AAI
    4. Mash
    5. Simka
    6. Distance
  3. Software Dependencies
  4. How to Cite
  5. Future Improvements

Example output

Finished heatmap figure

Final Figure Example

Predicted clusters table (see: files/example_predicted_clusters.tsv)

The genome/sample order in the predicted_clusters.tsv file matches the order of the clustermap. The outputs also include a meta_colors.tsv file where a color has been assigned to each predicted cluster for each distance threshold, and a distmat.tsv which is the distance matrix reordered to match the clustermap.

genomes clade-0 (96.0% ANI; t=0.04) clade-1 (97.4% ANI; t=0.026) clade-2 (98.0% ANI; t=0.02) clade-3 (98.7% ANI; t=0.013) clade-4 (99.2% ANI; t=0.008) clade-5 (99.8% ANI; t=0.002)
gnm_1383800v1 c0-006 c1-007 c2-008 c3-013 c4-032 c5-039
gnm_1389944v1 c0-006 c1-007 c2-008 c3-019 c4-048 c5-059
gnm_1380692v1 c0-006 c1-007 c2-008 c3-019 c4-048 c5-059
gnm_1380608v1 c0-006 c1-007 c2-008 c3-019 c4-048 c5-059
gnm_1380652v1 c0-006 c1-007 c2-008 c3-019 c4-048 c5-059
gnm_1388647v1 c0-006 c1-007 c2-008 c3-019 c4-048 c5-059
gnm_1388747v1 c0-006 c1-007 c2-008 c3-019 c4-048 c5-059
gnm_1383334v1 c0-006 c1-007 c2-008 c3-024 c4-053 c5-066
gnm_1388961v1 c0-006 c1-007 c2-008 c3-024 c4-053 c5-066
gnm_1389954v1 c0-006 c1-007 c2-008 c3-024 c4-053 c5-066
gnm_1389962v1 c0-006 c1-007 c2-008 c3-024 c4-053 c5-066
gnm_1389099v1 c0-006 c1-007 c2-008 c3-020 c4-049 c5-061
gnm_1383380v1 c0-006 c1-007 c2-008 c3-020 c4-049 c5-060

Additional output files (see tests folder for actual files)

  1. distmat.tsv - square data matrix sorted according to clustered heatmap
  2. meta_colors.tsv - colors corresponding to unique values in predicted_clusters.tsv file
  3. Legend_clade-x.pdf - Legends corresponding to the predicted clades. These can be added to the figure using Adobe Illustrator or other.

Usage

Their are three initial options:

  1. Predict clusters for local minimumns in the distance distribution (default).
  2. Predict clusters for user defined values (-user). e.g. -user 95 97.5 98.5 99.5
  3. Skip cluster prediction and use user defined metadata. e.g. -m meta_data.tsv -c meta_colors.tsv

And several additional parameters:

  1. -dtype, Specify data type. One of fastANI (default), ANI, AAI, Mash, Simka, Distance
    1. fastANI (default) - this takes the longform fastANI output and parses it into a distance matrix.
    2. ANI - this is for a square all vs. all matrix of ANI values with 100's down the diagonal.
    3. AAI - this is for a square all vs. all matrix of AAI values with 100's down the diagonal.
    4. Mash - this is for a Mash distance matrix.
    5. Simka - this is for a Simka distance matrix.
    6. Distance - this is any square distance matrix with values 0-1.
  2. -dmin, Specify the minimum value for colors in the figure.
  3. -dmax, Specify the maximum value for colors in the figure.
  4. -metric, set different cluster metrics. choose from:
    1. euclidean, seuclidean, correlation, hamming, jaccard, braycurtis
    2. See webpage/docs for scipy.spatial.distance.pdist for additional options and more info.
  5. -method, set different cluster method choose from:
    1. single, complete, average, weighted, centroid, median, ward
    2. See webpage/docs for scipy.cluster.hierarchy.linkage for more info.
  6. -no, Create clustermap of the data without cluster prediction or metadata.
  7. -bw, Adjust the KDE bandwidth parameter to control smoothing and valley prediction.
  8. -l, Add axis labels to plot. Default is no labels because they are typically difficult to read.
  9. -hcg, Specify the number of colors to use in the heatmap color gradient (2-11). If you would like to customize the colors, you can edit the simple color list in HeatSeq.py on lines 273-277 with any colors you'd like.
# print the help menu
python HeatSeq.py -h

Case One: Default

The default case uses scipy.stats.gaussian_kde and scipy.signal.find_peaks to estimate local minimums and maximums in the distance distribution. The local minimums are used as distance thresholds during the hierarchical clustering process.

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/01_example_default

Note the bandwidth affects the smoothing in the KDE estimate which in turn influences the local minimum (valley) estimates. You can use the -bw parameter to experiment

KDE with peaks (local maximums) and valleys (local minimums)

Final Figure Example

Case Two: User Defined

The user defined case allows the local minimum estimates to be amended, appended, or changed entirely based on pure curiosity and speculation. In the example below we amended and appended the local minimums that were identified in case one using some intra-species level distance thresholds.

Use -user parameter to input desired thresholds

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/02_example_default -user 96 97.4 98 98.7 99.2 99.8

Case Three: Custom Metadata

The custom metadata case allows the user to modify the predicted_clusters.tsv and the meta_colors.tsv files output from the case one or two outputs to include additional metadata categories and/or alter the colors. In this example we've added niche and clermontyping assignments for each genome and we've added color assignments for each unique niche or clermontyping category.

Alternately, Case one and/or two may be skipped entirely if the user prepares a properly formatted meta data file (predicted_clusters.tsv) and a meta colors file on their own.

Meta file formatting

See example files in tests folder

  1. Meta data: A tab separated metadata file with a row for each genome in the ANI file with the exact genome name, and columns for each meta value. File should include a header.
  2. Meta colors: A tab separated file of unique meta catagories and colors. Note: Each meta data category must be unique across all meta data (e.g. Site B1, Phylogroup B1 cannot both be B1).

Meta data example:

Genome Site Niche Phylogroup
Genome1 Site A1 Sheep B1
Genome2 Site B1 Cow E
Genome3 Site C1 Pig A

Meta colors example:

Site A1 #ffffb3
Site B1 #377eb8
Site C1 #ff7f00
Sheep #f781bf
Cow #4daf4a
Pig #3f1gy5
B1 #f781bf
E #4daf4a
A #3f1gy5

Use -m and -c parameters to input file paths

An easy way to customize the metadata is to run HeatSeq™ like Case One or Case Two, and then open the predicted_clusters.tsv and meta_colors.tsv files in Excel or other spreadsheet program to amend and/or append the desired data.

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/03_example_default -m files/01_example_predicted_clusters.tsv -c files/01_example_meta_colors.tsv

(Return to Table of Contents)

Examples

fastANI

fastANI is the default case and was utilized in the case examples above. Here we show examples with some optional parameters

fastANI: https://github.com/ParBLiSS/FastANI

# Use fastANI in many to many WITHOUT the --matrix parameter
# if used --matrix and have the fastANI.matrix file use -dtype ANI outlined below.
./fastANI --ql [QUERY_LIST] --rl [REFERENCE_LIST] -o 01_example_fastANI_allV.tsv

# set min and max
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xa_example_default -dmin 90 -dmax 99 -user 98.5 96.5

# no metadata or predicted clusters
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xb_example_default -no True

# different method and metric
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xc_example_default -method centroid -metric correlation -user 99.5 98.5 97.5 96.5

ANI

This option is to accommodate ANI values estimated with other tools. Arrange the data into a square matrix and format at a tsv file.

This matrix should be actual ANI values with 100's along the diagonal (for self matches) and not 100-ANI with 0's on the diagonal.

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/04_example_ANI -dtype ANI

this option has incomplete example and testing but should work.

AAI

This options alters the value range for hierarchical clustering and the heatmap to accommodate the lower percentage range of AAI estimates. Any AAI tool may be used. Arrange the data into a square matrix and format at a tsv file.

This matrix should be actual AAI values with 100's along the diagonal (for self matches) and not 100-AAI with 0's on the diagonal.

One AAI option is from the Enveomics Collection. Enveomics collection: http://enve-omics.ce.gatech.edu/enveomics/docs

  1. First run aai.rb for all vs all genomes: http://enve-omics.ce.gatech.edu/enveomics/docs?t=aai.rb
  2. Then run Table.df2dsit.R to convert to a square distance matrix: http://enve-omics.ce.gatech.edu/enveomics/docs?t=Table.df2dist.R
python HeatSeq.py -i files/05_example_AAI_allV.tsv -o tests/05_example_AAI -dtype AAI -l True -hcg 3 -m files/05a_metadata.tsv -c files/05b_metacolors.tsv

Mash

This option takes the output from Mash. It alters the value range for hierarchical clustering and the heatmap to fit 0-1. Diagonals (self matches) should be 0's.

Mash releases: https://github.com/marbl/Mash/releases Mash docs: https://mash.readthedocs.io/

# Step 1: create sketch of all genomes or metagenomes.
mkdir mySketches
for f in myFiles/*fasta; do n=`basename $f | cut -d. -f1`; ./mash sketch -o mySketches/$n $f; done
# * # myFiles may contain metagenomes or genomes in fastq or fasta format.

# Step 2: concatenate sketches in single file.
./mash paste allSketches mySketches/*.msh

# step 3. compute all vs all dist with allSketches.
./mash dist allSketches.msh allSketches.msh -t > files/06_example_mash_allV.tsv

# Step 4: predict clusters and build heatmap figure.
python HeatSeq.py -i files/06_example_mash_allV.tsv -o tests/06_example_MASH -dtype Mash -user 0.06 0.08 0.1

Simka

This option takes the output from Simka. It alters the value range for hierarchical clustering and the heatmap to fit 0-1. Diagonals (self matches) should be 0's.

Simka: https://github.com/GATB/simka

# Step 1: Generate all vs all input file.
>for f in ${reads_dir}/*1.fastq; do n=basename $f | cut -d_ -f1; echo '${n}:$f;${n}2.fastq'; done > simka_input.txt

# Step 2: run all vs all simka with 50GB and 12 cores.
simka -max-memory 50000 -nb-cores 12 -in simka_input.txt -out simka_out -out-tmp simka_temp

# Step 3: predict clusters and build heatmap figure.
python HeatSeq.py -i files/07_example_simka_allV.txt -o tests/07_example_SIMKA -dtype Simka -user 0.5 0.7 0.9

Distance

Any square matrix with values 0-1 may be used. Matrix file should be formatted as a tsv file with diagonals (self matches) equal to 0.

this option has incomplete example and testing but should work.

python HeatSeq.py -i files/08_example_distance_allV.tsv -o tests/08_example_distance -dtype Distance

(Return to Table of Contents)

Software Dependencies

External dependencies

Code in this repo uses only Python and Python packages which are listed below. However, it requires the user to independently compute distance (ANI, AAI, Mash etc.) with an external tool prior to running this code.

Use fastANI, enveomics, Mash, Simka etc.

Required packages for Python

Python and all packages can be easily installed with conda or pip. Prodigal, BLAST+ and MMseqs2 can also be installed easily with Conda. Just search "conda install name"

References

  1. Sanner MF. Python: a programming language for software integration and development. J Mol Graph Model. 1999 Feb 1;17(1):57-61.
  2. Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009.
  3. McKinney W, others. Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference. 2010. p. 51–6.
  4. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585:357–62.
  5. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods. 2020;17:261–72.
  6. Hunter JD. Matplotlib: A 2D graphics environment. Computing in science & engineering. 2007;9(3):90–5.
  7. Waskom ML. Seaborn: statistical data visualization. Journal of Open Source Software. 2021 Apr 6;6(60):3021.

(Return to Table of Contents)

How to Cite

If you use any part of this workflow in your research please cite the following manuscript:

PLACEHOLDER FOR MANUSCRIPT CITATION AND LINK

Future Improvements

  1. Add additional parsing support for outputs from other tools to save the user from creating the distance matrix.
  2. Add support for checkM or other tools to create the option of selecting a cluster representative for each predicted cluster.
  3. Add cluster analysis tools and figures to evaluate the effectiveness of clustering at each distance threshold. Silhouette plots etc.
  4. Add KDE bandwidth parameter optimization.

(Return to Table of Contents)

About

Cluster munitions for microbial genomic sequence data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages