HeatSeq™ allows the user to quickly and easily predict and visualize distance based clusters for all vs. all ANI, AAI, Mash distance and others. Excellent for exploring distance based genomic clades or sample groups. Developed to investigate intraspecies clades such as genomovars and phylogroups.
HeatSeq™ uses hierarchical clustering to generate distance based cluster predictions for genomic (ANI, AAI) and metagenomic (beta distance, MASH, Simka) data, or any other all vs. all distance type matrix, and it builds a clustered heatmap with column labels to highlight predicted cluster assignments and other user provided metadata. It outputs a seaborn clustermap in vector based PDF format along with separate PDF legend corresponding to each metadata row. It also outputs a metadata tab separated value (tsv) file with the predicted cluster assignments for each cluster threshold for each genome, metagenome, or sample along with a separate tsv file specifying the color assignments. These output files may be modified by the user and input back into the code to customize the final figure output.
The genome/sample order in the predicted_clusters.tsv file matches the order of the clustermap. The outputs also include a meta_colors.tsv file where a color has been assigned to each predicted cluster for each distance threshold, and a distmat.tsv which is the distance matrix reordered to match the clustermap.
genomes | clade-0 (96.0% ANI; t=0.04) | clade-1 (97.4% ANI; t=0.026) | clade-2 (98.0% ANI; t=0.02) | clade-3 (98.7% ANI; t=0.013) | clade-4 (99.2% ANI; t=0.008) | clade-5 (99.8% ANI; t=0.002) |
---|---|---|---|---|---|---|
gnm_1383800v1 | c0-006 | c1-007 | c2-008 | c3-013 | c4-032 | c5-039 |
gnm_1389944v1 | c0-006 | c1-007 | c2-008 | c3-019 | c4-048 | c5-059 |
gnm_1380692v1 | c0-006 | c1-007 | c2-008 | c3-019 | c4-048 | c5-059 |
gnm_1380608v1 | c0-006 | c1-007 | c2-008 | c3-019 | c4-048 | c5-059 |
gnm_1380652v1 | c0-006 | c1-007 | c2-008 | c3-019 | c4-048 | c5-059 |
gnm_1388647v1 | c0-006 | c1-007 | c2-008 | c3-019 | c4-048 | c5-059 |
gnm_1388747v1 | c0-006 | c1-007 | c2-008 | c3-019 | c4-048 | c5-059 |
gnm_1383334v1 | c0-006 | c1-007 | c2-008 | c3-024 | c4-053 | c5-066 |
gnm_1388961v1 | c0-006 | c1-007 | c2-008 | c3-024 | c4-053 | c5-066 |
gnm_1389954v1 | c0-006 | c1-007 | c2-008 | c3-024 | c4-053 | c5-066 |
gnm_1389962v1 | c0-006 | c1-007 | c2-008 | c3-024 | c4-053 | c5-066 |
gnm_1389099v1 | c0-006 | c1-007 | c2-008 | c3-020 | c4-049 | c5-061 |
gnm_1383380v1 | c0-006 | c1-007 | c2-008 | c3-020 | c4-049 | c5-060 |
- distmat.tsv - square data matrix sorted according to clustered heatmap
- meta_colors.tsv - colors corresponding to unique values in predicted_clusters.tsv file
- Legend_clade-x.pdf - Legends corresponding to the predicted clades. These can be added to the figure using Adobe Illustrator or other.
Their are three initial options:
- Predict clusters for local minimumns in the distance distribution (default).
- Predict clusters for user defined values (-user). e.g. -user 95 97.5 98.5 99.5
- Skip cluster prediction and use user defined metadata. e.g. -m meta_data.tsv -c meta_colors.tsv
And several additional parameters:
- -dtype, Specify data type. One of fastANI (default), ANI, AAI, Mash, Simka, Distance
- fastANI (default) - this takes the longform fastANI output and parses it into a distance matrix.
- ANI - this is for a square all vs. all matrix of ANI values with 100's down the diagonal.
- AAI - this is for a square all vs. all matrix of AAI values with 100's down the diagonal.
- Mash - this is for a Mash distance matrix.
- Simka - this is for a Simka distance matrix.
- Distance - this is any square distance matrix with values 0-1.
- -dmin, Specify the minimum value for colors in the figure.
- -dmax, Specify the maximum value for colors in the figure.
- -metric, set different cluster metrics. choose from:
- euclidean, seuclidean, correlation, hamming, jaccard, braycurtis
- See webpage/docs for scipy.spatial.distance.pdist for additional options and more info.
- -method, set different cluster method choose from:
- single, complete, average, weighted, centroid, median, ward
- See webpage/docs for scipy.cluster.hierarchy.linkage for more info.
- -no, Create clustermap of the data without cluster prediction or metadata.
- -bw, Adjust the KDE bandwidth parameter to control smoothing and valley prediction.
- -l, Add axis labels to plot. Default is no labels because they are typically difficult to read.
- -hcg, Specify the number of colors to use in the heatmap color gradient (2-11). If you would like to customize the colors, you can edit the simple color list in HeatSeq.py on lines 273-277 with any colors you'd like.
# print the help menu
python HeatSeq.py -h
The default case uses scipy.stats.gaussian_kde and scipy.signal.find_peaks to estimate local minimums and maximums in the distance distribution. The local minimums are used as distance thresholds during the hierarchical clustering process.
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/01_example_default
Note the bandwidth affects the smoothing in the KDE estimate which in turn influences the local minimum (valley) estimates. You can use the -bw parameter to experiment
The user defined case allows the local minimum estimates to be amended, appended, or changed entirely based on pure curiosity and speculation. In the example below we amended and appended the local minimums that were identified in case one using some intra-species level distance thresholds.
Use -user parameter to input desired thresholds
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/02_example_default -user 96 97.4 98 98.7 99.2 99.8
The custom metadata case allows the user to modify the predicted_clusters.tsv and the meta_colors.tsv files output from the case one or two outputs to include additional metadata categories and/or alter the colors. In this example we've added niche and clermontyping assignments for each genome and we've added color assignments for each unique niche or clermontyping category.
Alternately, Case one and/or two may be skipped entirely if the user prepares a properly formatted meta data file (predicted_clusters.tsv) and a meta colors file on their own.
See example files in tests folder
- Meta data: A tab separated metadata file with a row for each genome in the ANI file with the exact genome name, and columns for each meta value. File should include a header.
- Meta colors: A tab separated file of unique meta catagories and colors. Note: Each meta data category must be unique across all meta data (e.g. Site B1, Phylogroup B1 cannot both be B1).
Meta data example:
Genome | Site | Niche | Phylogroup |
---|---|---|---|
Genome1 | Site A1 | Sheep | B1 |
Genome2 | Site B1 | Cow | E |
Genome3 | Site C1 | Pig | A |
Meta colors example:
Site A1 | #ffffb3 |
---|---|
Site B1 | #377eb8 |
Site C1 | #ff7f00 |
Sheep | #f781bf |
Cow | #4daf4a |
Pig | #3f1gy5 |
B1 | #f781bf |
E | #4daf4a |
A | #3f1gy5 |
Use -m and -c parameters to input file paths
An easy way to customize the metadata is to run HeatSeq™ like Case One or Case Two, and then open the predicted_clusters.tsv and meta_colors.tsv files in Excel or other spreadsheet program to amend and/or append the desired data.
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/03_example_default -m files/01_example_predicted_clusters.tsv -c files/01_example_meta_colors.tsv
fastANI is the default case and was utilized in the case examples above. Here we show examples with some optional parameters
fastANI: https://github.com/ParBLiSS/FastANI
# Use fastANI in many to many WITHOUT the --matrix parameter
# if used --matrix and have the fastANI.matrix file use -dtype ANI outlined below.
./fastANI --ql [QUERY_LIST] --rl [REFERENCE_LIST] -o 01_example_fastANI_allV.tsv
# set min and max
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xa_example_default -dmin 90 -dmax 99 -user 98.5 96.5
# no metadata or predicted clusters
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xb_example_default -no True
# different method and metric
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xc_example_default -method centroid -metric correlation -user 99.5 98.5 97.5 96.5
This option is to accommodate ANI values estimated with other tools. Arrange the data into a square matrix and format at a tsv file.
This matrix should be actual ANI values with 100's along the diagonal (for self matches) and not 100-ANI with 0's on the diagonal.
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/04_example_ANI -dtype ANI
this option has incomplete example and testing but should work.
This options alters the value range for hierarchical clustering and the heatmap to accommodate the lower percentage range of AAI estimates. Any AAI tool may be used. Arrange the data into a square matrix and format at a tsv file.
This matrix should be actual AAI values with 100's along the diagonal (for self matches) and not 100-AAI with 0's on the diagonal.
One AAI option is from the Enveomics Collection. Enveomics collection: http://enve-omics.ce.gatech.edu/enveomics/docs
- First run aai.rb for all vs all genomes: http://enve-omics.ce.gatech.edu/enveomics/docs?t=aai.rb
- Then run Table.df2dsit.R to convert to a square distance matrix: http://enve-omics.ce.gatech.edu/enveomics/docs?t=Table.df2dist.R
python HeatSeq.py -i files/05_example_AAI_allV.tsv -o tests/05_example_AAI -dtype AAI -l True -hcg 3 -m files/05a_metadata.tsv -c files/05b_metacolors.tsv
This option takes the output from Mash. It alters the value range for hierarchical clustering and the heatmap to fit 0-1. Diagonals (self matches) should be 0's.
Mash releases: https://github.com/marbl/Mash/releases Mash docs: https://mash.readthedocs.io/
# Step 1: create sketch of all genomes or metagenomes.
mkdir mySketches
for f in myFiles/*fasta; do n=`basename $f | cut -d. -f1`; ./mash sketch -o mySketches/$n $f; done
# * # myFiles may contain metagenomes or genomes in fastq or fasta format.
# Step 2: concatenate sketches in single file.
./mash paste allSketches mySketches/*.msh
# step 3. compute all vs all dist with allSketches.
./mash dist allSketches.msh allSketches.msh -t > files/06_example_mash_allV.tsv
# Step 4: predict clusters and build heatmap figure.
python HeatSeq.py -i files/06_example_mash_allV.tsv -o tests/06_example_MASH -dtype Mash -user 0.06 0.08 0.1
This option takes the output from Simka. It alters the value range for hierarchical clustering and the heatmap to fit 0-1. Diagonals (self matches) should be 0's.
Simka: https://github.com/GATB/simka
# Step 1: Generate all vs all input file.
>for f in ${reads_dir}/*1.fastq; do n=basename $f | cut -d_ -f1; echo '${n}:$f;${n}2.fastq'; done > simka_input.txt
# Step 2: run all vs all simka with 50GB and 12 cores.
simka -max-memory 50000 -nb-cores 12 -in simka_input.txt -out simka_out -out-tmp simka_temp
# Step 3: predict clusters and build heatmap figure.
python HeatSeq.py -i files/07_example_simka_allV.txt -o tests/07_example_SIMKA -dtype Simka -user 0.5 0.7 0.9
Any square matrix with values 0-1 may be used. Matrix file should be formatted as a tsv file with diagonals (self matches) equal to 0.
this option has incomplete example and testing but should work.
python HeatSeq.py -i files/08_example_distance_allV.tsv -o tests/08_example_distance -dtype Distance
Code in this repo uses only Python and Python packages which are listed below. However, it requires the user to independently compute distance (ANI, AAI, Mash etc.) with an external tool prior to running this code.
Use fastANI, enveomics, Mash, Simka etc.
Python and all packages can be easily installed with conda or pip. Prodigal, BLAST+ and MMseqs2 can also be installed easily with Conda. Just search "conda install name"
- Sanner MF. Python: a programming language for software integration and development. J Mol Graph Model. 1999 Feb 1;17(1):57-61.
- Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009.
- McKinney W, others. Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference. 2010. p. 51–6.
- Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585:357–62.
- Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods. 2020;17:261–72.
- Hunter JD. Matplotlib: A 2D graphics environment. Computing in science & engineering. 2007;9(3):90–5.
- Waskom ML. Seaborn: statistical data visualization. Journal of Open Source Software. 2021 Apr 6;6(60):3021.
If you use any part of this workflow in your research please cite the following manuscript:
PLACEHOLDER FOR MANUSCRIPT CITATION AND LINK
- Add additional parsing support for outputs from other tools to save the user from creating the distance matrix.
- Add support for checkM or other tools to create the option of selecting a cluster representative for each predicted cluster.
- Add cluster analysis tools and figures to evaluate the effectiveness of clustering at each distance threshold. Silhouette plots etc.
- Add KDE bandwidth parameter optimization.