Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding differentially abundant features between metagenomic and metatranscriptomic datasets #80

Open
johnne opened this issue Feb 1, 2025 · 0 comments

Comments

@johnne
Copy link

johnne commented Feb 1, 2025

Thanks for a great package and several useful papers on compositional data analysis!

I'm working with a time-series of samples for which metagenomic (mg) and metatranscriptomic (mt) data has been generated using Illumina sequencing. Reads from all samples have been mapped to the same features (a collection of genes predicted in 806 Metagenome Assembled Genomes (MAGs)). There are a total of 84 samples taken in the same location at different dates over three years and for 19 sampling dates there's both mg and mt data.

In one part of the project I'm trying to identify MAGs that have a significantly different proportion in one dataset compared to the other, that is to find MAGs whose transcriptional activity is significantly higher/lower. I read the Quinn et al field guide paper and found the section on vertical data integration highly interesting as it sounds like what I'm trying to do

use ALDEx2 to find features where mRNA abundance changes more than protein abundance, relative to a common reference (and vice versa).

What I've done so far is to use the ALDEx2 package with this set up:

  1. subset both datasets to the 19 dates with paired omics data
  2. sum raw counts for each MAG in each sample
  3. create a condition vector with the omics type ('mg' and 'mt')
  4. run a modular ALDEx2 analysis on the 806 x 38 count matrix, including clr transformation, calculation of paired t-test statistics and effect size estimation

Using this approach I've managed to identify a number of MAGs that appear to be differentially abundant between datasets (see attached figure), using criteria for log2 fold change, q-values and 95% confidence intervals.

My question is: Is this a valid method or are there modifications that should be made here? Specifically I'm wondering if I should perform the CLR transformation differently as now the transformed values are wrt the full dataset (mg + mt). Should I instead attempt to CLR transform each dataset separately before running ALDEx2?

I understand that answering questions related to specific projects is asking a lot, but any insights or suggestions that you feel you have the time to offer would be greatly appreciated.

All the best,
John

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant