Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance matrix of tensors? #108

Open
jonbakerlab opened this issue Jan 10, 2025 · 6 comments
Open

Distance matrix of tensors? #108

jonbakerlab opened this issue Jan 10, 2025 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@jonbakerlab
Copy link

Hey Cameron, thanks for the great suite of tools! We are using gemelli and have a question regarding testing the Phylo CTF output for significance. We have a dataset that is different sites from subjects (rather than longitudinal data), and applying Phylo CTF does give better separation on a biplot of our healthy and disease subjects than the individual samples did (i.e., biplots of various beta diversity metrics). It would seem to me that we should do some sort of statistical test to see if our healthy and disease groups are truly different based on data represented in the Phylo CTF subject biplot (e.g., a PERMANOVA). However, the 'distance_matrix.qza' file output by qiime gemelli phylogenetic-ctf-with-taxonomy contains the distances of all the samples, rather than distances of the subject tensors, so using it as input for qiime diversity beta-group-significance would seem to give the same result as the data before going through Phylo CTF...? Is there an actual output from Phylo CTF that we could use as input for a PERMANOVA, or how should we test for significance? Doesn't seem like any of the output from qiime gemelli phylogenetic-ctf-with-taxonomy would be usable for that? Thanks for your help!

@cameronmartino cameronmartino self-assigned this Jan 10, 2025
@cameronmartino cameronmartino added the enhancement New feature or request label Jan 10, 2025
@cameronmartino
Copy link
Collaborator

Hi @jonbakerlab,

Thanks for the kind words and for using gemelli! It is awesome to hear that Phylo-CTF is helping extract signals in the data.

My preferred way is to split the distance matrix by timepoint/site and test for sig. differences by phenotype between subjects per time/site and then on aggregate. However, you are the second person to ask this question this month and what you are asking for makes sense, especially in the context of multiple body sites, which is why I am flagging this as a requested feature. I will get this output included in the next version update of the tool. In the meantime, you can generate it yourself. It will involve a bit of code but here is an example of how to calculate it using QIIME2 API IBD Example that should hopefully be easy to apply to your data. This will give a per subject distance (not per sample as is the default):

First run Phylo-CTF like the tutorial (you probably already did this step)

import os
import warnings
import qiime2 as q2
from qiime2.plugins.gemelli.actions import phylogenetic_ctf_with_taxonomy
from qiime2.plugins.fragment_insertion.methods import filter_features
# hide pandas Future/Deprecation Warning(s) for tutorial
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.simplefilter(action='ignore', category=FutureWarning)

# import table(s)
table = q2.Artifact.load('IBD-2538/data/table.qza')
rarefied_table = q2.Artifact.load('IBD-2538/data/rarefied-table.qza')
# import metadata
metadata = q2.Metadata.load('IBD-2538/data/metadata.tsv')
# import tree
tree = q2.Artifact.load('IBD-2538/data/sepp-insertion-tree.qza')
# import taxonomy
taxonomy = q2.Artifact.load('IBD-2538/data/taxonomy.qza')

# make a dir. for results
os.mkdir('IBD-2538/ctf-results')
# first ensure all the table features are in the tree
table = filter_features(table, tree).filtered_table
# now run phylo-CTF
ctf_results = phylogenetic_ctf_with_taxonomy(table, tree, taxonomy.view(q2.Metadata),
                                             metadata, 'host_subject_id', 'timepoint',
                                             min_feature_frequency=10)
# write all the artifacts to file by name
for art_tmp_name_, art_tmp_ in ctf_results._asdict().items():
    art_tmp_.save('IBD-2538/ctf-results/%s.qza' % (art_tmp_name_))

Now make a subject-level distance (your likely starting point)

import qiime2 as q2
from scipy.spatial import distance
from skbio import OrdinationResults, DistanceMatrix

# import and extract subject ordination (If not in QIIME2 don't need to extract just load as skbio type)
q2_subject_biplot = q2.Artifact.load('IBD-2538/ctf-results/subject_biplot.qza')
sk_subject_biplot = q2_subject_biplot.view(OrdinationResults)
# make subject distance
subject_U = sk_subject_biplot.samples.copy()
Udist = distance.cdist(subject_U.copy(),
                       subject_U.copy())
U_dist_res = DistanceMatrix(Udist, ids=subject_U.index)
U_dist_res_q2 = q2.Artifact.import_data('DistanceMatrix', U_dist_res)
# write the distances in skbio and QIIME2 formats
U_dist_res_q2.save('subject_distance.qza')
U_dist_res.write('subject_distance.txt')

I hope that solves your problem, let me know if you have any issues running the above code on your data.

Best,

Cameron

@jonbakerlab
Copy link
Author

Awesome, thanks very much @cameronmartino! It did work with our data...our group differences weren't significant at the subject level after all 🙃 , but good to have a statistical test to tell us one way or the other.

@cameronmartino
Copy link
Collaborator

@jonbakerlab Often at the subject level, the lower sample size impacts the significance, or it can be that not all of the time points are significant individually. Might be worth splitting out the whole distance by time point and testing that way. Sometimes one non-sig timepoint can lead to an overall non-sig subject level difference.

@jonbakerlab
Copy link
Author

@cameronmartino
Quick follow-up question: where is the rarefied-table used in the above code? We noticed it's also mentioned in the main tutorial, but doesn't seem to be used by any of the downstream steps? Related, the phylogenetic_ctf_with_taxonomy script runs the robust CLR within it, right? i.e. we don't need to rarify or normalize the feature table prior to phylogenetic_ctf_with_taxonomy?

@cameronmartino
Copy link
Collaborator

@jonbakerlab In most cases no, you do not need to use the rarefied-table with phylogenetic_ctf_with_taxonomy. If it looks like your sequencing depth may be confounded to your variable of interest (e.g., IBD vs. non-IBD) like the examples here or if you think the distance/ordination is being influenced by sequencing depth, which can happen. Then you can check if rarefying the table helps prevent that, for example on how to check here (see the qiime gemelli qc-rarefy command) although that example is for RPCA - the same method can be used for CTF. If this is the case you can always use the rarefied-table with CTF and RPCA, some choose to use it regardless of the QC results. I purposefully left it so the user can determine what they want to do, rather than forcing either case.

Let me know if you have any questions!

@jonbakerlab
Copy link
Author

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants