MLE for proteomics data imputation #109

ginnyintifa · 2023-01-07T03:30:24Z

Dear Team,

MLE is one of the imputation options, which calls the em.norm and imp.norm functions from the norm package. And implemented by Margin ==2 .

I think Margin ==2 is a reasonable setting since the p*n original data matrix (features in rows and samples in columns) would be transposed before sending to the EM algorithm. Therefore when doing EM each feature would be the actual genes/proteins/peptides.

But the issue is proteomics data is always p>>n. We would have ~20000 proteins and a dozen of samples in TMT global proteome data set for example. Then with as good number of features, EM algorithm is so expensive.

I am trying this data set (10k * 24) with the impute_mle function and haven't got any results yet.

dtmt = fread("ccRCC_prot_abundance_MD_3plex.tsv",
          stringsAsFactors = F, data.table = F)
dd = as.matrix(dtmt[,-c(1:5)])
dtmt_res = MsCoreUtils::impute_mle(dd)

Do you have any insights on this issue?

Thank you very much!

The text was updated successfully, but these errors were encountered:

lgatto · 2023-01-07T10:31:11Z

I don't have any suggestion in terms of speeding up the underlying implementation. You could possibly try to split your data in chunks and parallelise the imputation.
There's also impute_mle2() function (see Imputation using MLE #100). I'll update the documentation, as I now see that it isn't explicitly mentioned in the MLE imputation paragraph.
Setting MARGIN == 2 imputes along the columns. If you want to impute along the features, you need to set it to 1. If you see a different behaviour, it's a bug and please do let me know. The discussion about the margins is actually more involved, I think, and will also depend on downstream applications.
As for imputation in general, I do think it's not straightforward, and my advice would be to (1) filter features that have too many missing values and (2) not to impute, unless you have to.

lgatto · 2023-01-07T12:50:11Z

By the way, if you are processing quantitative proteomics data, I highly advise to consider giving the QFeatures package a go.

hsiaoyi0504 · 2024-10-01T02:08:31Z

@lgatto Is there any recent change of MLE? We are actually in a class using imputation from MSnbase. What we noticed is that it looks like something change from versions and the data takes forever to be imputed using MLE.

lgatto · 2024-10-01T06:03:38Z

@hsiaoyi0504 - there have been changes in the past, such as adding support for the norm2 package (about 2 years ago), and then dropping it again last year because it was removed from CRAN. About 2 years ago, we also added a MARGIN argument that defines if rows or columns-wise imputation should be done.

hsiaoyi0504 mentioned this issue Oct 1, 2024

Fragpipe-Analyst 500 internal server error Nesvilab/FragPipe-Analyst#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLE for proteomics data imputation #109

MLE for proteomics data imputation #109

ginnyintifa commented Jan 7, 2023

lgatto commented Jan 7, 2023

lgatto commented Jan 7, 2023

hsiaoyi0504 commented Oct 1, 2024

lgatto commented Oct 1, 2024

MLE for proteomics data imputation #109

MLE for proteomics data imputation #109

Comments

ginnyintifa commented Jan 7, 2023

lgatto commented Jan 7, 2023

lgatto commented Jan 7, 2023

hsiaoyi0504 commented Oct 1, 2024

lgatto commented Oct 1, 2024