Distance matrix of tensors? #108

jonbakerlab · 2025-01-10T18:49:12Z

Hey Cameron, thanks for the great suite of tools! We are using gemelli and have a question regarding testing the Phylo CTF output for significance. We have a dataset that is different sites from subjects (rather than longitudinal data), and applying Phylo CTF does give better separation on a biplot of our healthy and disease subjects than the individual samples did (i.e., biplots of various beta diversity metrics). It would seem to me that we should do some sort of statistical test to see if our healthy and disease groups are truly different based on data represented in the Phylo CTF subject biplot (e.g., a PERMANOVA). However, the 'distance_matrix.qza' file output by qiime gemelli phylogenetic-ctf-with-taxonomy contains the distances of all the samples, rather than distances of the subject tensors, so using it as input for qiime diversity beta-group-significance would seem to give the same result as the data before going through Phylo CTF...? Is there an actual output from Phylo CTF that we could use as input for a PERMANOVA, or how should we test for significance? Doesn't seem like any of the output from qiime gemelli phylogenetic-ctf-with-taxonomy would be usable for that? Thanks for your help!

The text was updated successfully, but these errors were encountered:

cameronmartino · 2025-01-10T21:05:50Z

Hi @jonbakerlab,

Thanks for the kind words and for using gemelli! It is awesome to hear that Phylo-CTF is helping extract signals in the data.

My preferred way is to split the distance matrix by timepoint/site and test for sig. differences by phenotype between subjects per time/site and then on aggregate. However, you are the second person to ask this question this month and what you are asking for makes sense, especially in the context of multiple body sites, which is why I am flagging this as a requested feature. I will get this output included in the next version update of the tool. In the meantime, you can generate it yourself. It will involve a bit of code but here is an example of how to calculate it using QIIME2 API IBD Example that should hopefully be easy to apply to your data. This will give a per subject distance (not per sample as is the default):

First run Phylo-CTF like the tutorial (you probably already did this step)

import os
import warnings
import qiime2 as q2
from qiime2.plugins.gemelli.actions import phylogenetic_ctf_with_taxonomy
from qiime2.plugins.fragment_insertion.methods import filter_features
# hide pandas Future/Deprecation Warning(s) for tutorial
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.simplefilter(action='ignore', category=FutureWarning)

# import table(s)
table = q2.Artifact.load('IBD-2538/data/table.qza')
rarefied_table = q2.Artifact.load('IBD-2538/data/rarefied-table.qza')
# import metadata
metadata = q2.Metadata.load('IBD-2538/data/metadata.tsv')
# import tree
tree = q2.Artifact.load('IBD-2538/data/sepp-insertion-tree.qza')
# import taxonomy
taxonomy = q2.Artifact.load('IBD-2538/data/taxonomy.qza')

# make a dir. for results
os.mkdir('IBD-2538/ctf-results')
# first ensure all the table features are in the tree
table = filter_features(table, tree).filtered_table
# now run phylo-CTF
ctf_results = phylogenetic_ctf_with_taxonomy(table, tree, taxonomy.view(q2.Metadata),
                                             metadata, 'host_subject_id', 'timepoint',
                                             min_feature_frequency=10)
# write all the artifacts to file by name
for art_tmp_name_, art_tmp_ in ctf_results._asdict().items():
    art_tmp_.save('IBD-2538/ctf-results/%s.qza' % (art_tmp_name_))

Now make a subject-level distance (your likely starting point)

import qiime2 as q2
from scipy.spatial import distance
from skbio import OrdinationResults, DistanceMatrix

# import and extract subject ordination (If not in QIIME2 don't need to extract just load as skbio type)
q2_subject_biplot = q2.Artifact.load('IBD-2538/ctf-results/subject_biplot.qza')
sk_subject_biplot = q2_subject_biplot.view(OrdinationResults)
# make subject distance
subject_U = sk_subject_biplot.samples.copy()
Udist = distance.cdist(subject_U.copy(),
                       subject_U.copy())
U_dist_res = DistanceMatrix(Udist, ids=subject_U.index)
U_dist_res_q2 = q2.Artifact.import_data('DistanceMatrix', U_dist_res)
# write the distances in skbio and QIIME2 formats
U_dist_res_q2.save('subject_distance.qza')
U_dist_res.write('subject_distance.txt')

I hope that solves your problem, let me know if you have any issues running the above code on your data.

Best,

Cameron

jonbakerlab · 2025-01-16T22:26:15Z

Awesome, thanks very much @cameronmartino! It did work with our data...our group differences weren't significant at the subject level after all 🙃 , but good to have a statistical test to tell us one way or the other.

cameronmartino · 2025-01-16T22:30:57Z

@jonbakerlab Often at the subject level, the lower sample size impacts the significance, or it can be that not all of the time points are significant individually. Might be worth splitting out the whole distance by time point and testing that way. Sometimes one non-sig timepoint can lead to an overall non-sig subject level difference.

jonbakerlab · 2025-01-23T20:32:07Z

@cameronmartino
Quick follow-up question: where is the rarefied-table used in the above code? We noticed it's also mentioned in the main tutorial, but doesn't seem to be used by any of the downstream steps? Related, the phylogenetic_ctf_with_taxonomy script runs the robust CLR within it, right? i.e. we don't need to rarify or normalize the feature table prior to phylogenetic_ctf_with_taxonomy?

cameronmartino · 2025-01-23T21:06:22Z

@jonbakerlab In most cases no, you do not need to use the rarefied-table with phylogenetic_ctf_with_taxonomy. If it looks like your sequencing depth may be confounded to your variable of interest (e.g., IBD vs. non-IBD) like the examples here or if you think the distance/ordination is being influenced by sequencing depth, which can happen. Then you can check if rarefying the table helps prevent that, for example on how to check here (see the qiime gemelli qc-rarefy command) although that example is for RPCA - the same method can be used for CTF. If this is the case you can always use the rarefied-table with CTF and RPCA, some choose to use it regardless of the QC results. I purposefully left it so the user can determine what they want to do, rather than forcing either case.

Let me know if you have any questions!

jonbakerlab · 2025-01-23T23:15:54Z

Awesome, thanks!

cameronmartino self-assigned this Jan 10, 2025

cameronmartino added the enhancement label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distance matrix of tensors? #108

Distance matrix of tensors? #108

jonbakerlab commented Jan 10, 2025

cameronmartino commented Jan 10, 2025

jonbakerlab commented Jan 16, 2025

cameronmartino commented Jan 16, 2025

jonbakerlab commented Jan 23, 2025

cameronmartino commented Jan 23, 2025

jonbakerlab commented Jan 23, 2025

Distance matrix of tensors? #108

Distance matrix of tensors? #108

Comments

jonbakerlab commented Jan 10, 2025

cameronmartino commented Jan 10, 2025

First run Phylo-CTF like the tutorial (you probably already did this step)

Now make a subject-level distance (your likely starting point)

jonbakerlab commented Jan 16, 2025

cameronmartino commented Jan 16, 2025

jonbakerlab commented Jan 23, 2025

cameronmartino commented Jan 23, 2025

jonbakerlab commented Jan 23, 2025