Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLE for proteomics data imputation #109

Open
ginnyintifa opened this issue Jan 7, 2023 · 4 comments
Open

MLE for proteomics data imputation #109

ginnyintifa opened this issue Jan 7, 2023 · 4 comments

Comments

@ginnyintifa
Copy link

Dear Team,

MLE is one of the imputation options, which calls the em.norm and imp.norm functions from the norm package. And implemented by Margin ==2 .

I think Margin ==2 is a reasonable setting since the p*n original data matrix (features in rows and samples in columns) would be transposed before sending to the EM algorithm. Therefore when doing EM each feature would be the actual genes/proteins/peptides.

But the issue is proteomics data is always p>>n. We would have ~20000 proteins and a dozen of samples in TMT global proteome data set for example. Then with as good number of features, EM algorithm is so expensive.

I am trying this data set (10k * 24) with the impute_mle function and haven't got any results yet.

dtmt = fread("ccRCC_prot_abundance_MD_3plex.tsv",
          stringsAsFactors = F, data.table = F)
dd = as.matrix(dtmt[,-c(1:5)])
dtmt_res = MsCoreUtils::impute_mle(dd)

Do you have any insights on this issue?

Thank you very much!

@lgatto
Copy link
Member

lgatto commented Jan 7, 2023

  • I don't have any suggestion in terms of speeding up the underlying implementation. You could possibly try to split your data in chunks and parallelise the imputation.
  • There's also impute_mle2() function (see Imputation using MLE #100). I'll update the documentation, as I now see that it isn't explicitly mentioned in the MLE imputation paragraph.
  • Setting MARGIN == 2 imputes along the columns. If you want to impute along the features, you need to set it to 1. If you see a different behaviour, it's a bug and please do let me know. The discussion about the margins is actually more involved, I think, and will also depend on downstream applications.
  • As for imputation in general, I do think it's not straightforward, and my advice would be to (1) filter features that have too many missing values and (2) not to impute, unless you have to.

@lgatto
Copy link
Member

lgatto commented Jan 7, 2023

By the way, if you are processing quantitative proteomics data, I highly advise to consider giving the QFeatures package a go.

@hsiaoyi0504
Copy link

@lgatto Is there any recent change of MLE? We are actually in a class using imputation from MSnbase. What we noticed is that it looks like something change from versions and the data takes forever to be imputed using MLE.

@lgatto
Copy link
Member

lgatto commented Oct 1, 2024

@hsiaoyi0504 - there have been changes in the past, such as adding support for the norm2 package (about 2 years ago), and then dropping it again last year because it was removed from CRAN. About 2 years ago, we also added a MARGIN argument that defines if rows or columns-wise imputation should be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants