Applying CellRank to multiple samples (with possible batch effects) #1141

YitengDang · 2023-11-20T09:58:34Z

YitengDang
Nov 20, 2023

I have a large scRNAseq data set with multiple libraries for different experimental conditions (time, tissue, perturbation), which I have integrated using Scanorama.

Now I would like to do trajectory inference on this combined data set with CellRank, but was wondering what is the best way of approaching this.

Apply trajectory inference to separate samples, then merge together results. Problem is that the inferred time from one sample might not be comparable to that of another sample.
Apply trajectory inference to integrated data. This only makes sense to me if the data integration directly transforms the counts (e.g. with scVI). Otherwise two biologically similar cell populations from different libraries might still be assigned very different pseudotimes if the data integration is only at the dimensionality reduction level (e.g. with Scanorama).

I found one paper discussing a method for dealing with exactly this problem, but I’m not sure how well it works: https://www.biorxiv.org/content/10.1101/2021.03.09.433671v1

If you have any thoughts or suggestions on how to approach this, let me know!

Marius1311 · 2023-11-20T20:45:23Z

Marius1311
Nov 20, 2023
Maintainer

Hi @YitengDang, thanks for your question. Could you please let me know which Cellrank 2 kernel you were planning to apply? That influences how to approach data integration in this context.

0 replies

YitengDang · 2023-11-27T12:58:11Z

YitengDang
Nov 27, 2023
Author

Thanks for the reply! I'm not 100% sure which kernel will work best, but for now my plan is to try the RealTimeKernel and VelocityKernel. I have libraries across 3 developmental time points, for each time point I have 2 conditions each (and multiple replicates per time+condition), but there are strong batch effects between these conditions. Preliminary RNA velocity results were confusing (inferred directions do not match biological knowledge, but this could be due to streamlines not being captured well in a low-dimensional representation). In any case, if results seem inconsistent then I will use or combine kernels with PseudotimeKernel.

0 replies

Marius1311 · 2023-11-29T08:57:07Z

Marius1311
Nov 29, 2023
Maintainer

Hi @YitengDang, if you have a time-series dataset, then I think using the RealTimeKernel, in combination with either the ConnectivityKernel or the PseudoTime kernel, to include within-time point dynamics, might work best for you, and would also make the integration problem easier, as you won't have to worry about integrating spliced/unspliced counts. There would be two parts to this challenge:

Applying the RealTimeKernel across time points. Under the hood, we use Optimal Transport for that, implemented in either moscot or WOT. These tools need to compute a cost between subsequent time points, so there needs to be some joint representation. I'm not sure whether you get that from Scanorama, or whether you just get a corrected k-NN graph. In that case, if you use moscot, you could compute a graph-based distance and input that as a custom cost matrix (more infos in the moscot docs). However, that cost would always have to be across-time points, so you need to be able to compare the cells in the two (subsequent) time points.
Applying the Connectivity/PseudotimeKernel within time points: that should not be an issue, as these kernels only need access to a k-NN graph, which you have.

Alternatively, you could compute one cross-time point trajectory for each condition, i.e. not integrate the data and then make comparisions between conditions on the level of these trajectories. Hope this helps!

0 replies

YitengDang · 2023-11-30T10:13:50Z

YitengDang
Nov 30, 2023
Author

Hi @Marius1311,
Thanks a lot for the answer, this definitely helps! I will try the methods you described. Indeed, Scanorama just generates a corrected kNN graph and no joint representation, so I will take a look at moscot.
I have a more specific follow-up question:

For each time point I have multiple replicates / sequencing libraries. Would you combine these replicates first before running a kernel on the data (in which case one would predict transitions between cells of different libraries), or would you apply these kernels to the separate libraries and then combine results by combining cell-cell transition matrices (in which case they would be block matrices with only transitions within each library)?
(I guess a similar question applies to inferring pseudotime, but this is probably outside the scope of CellRank.)

0 replies

Marius1311 · 2023-11-30T10:29:29Z

Marius1311
Nov 30, 2023
Maintainer

I think that depends on whether transitions within these libraries would make sense biologically - if these are really just replicates that describe essentially the same biology, I think I would allow transitions across replicates, hence first combining the replicates, then applying the kernel.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying CellRank to multiple samples (with possible batch effects) #1141

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Applying CellRank to multiple samples (with possible batch effects) #1141

YitengDang Nov 20, 2023

Replies: 5 comments

Marius1311 Nov 20, 2023 Maintainer

YitengDang Nov 27, 2023 Author

Marius1311 Nov 29, 2023 Maintainer

YitengDang Nov 30, 2023 Author

Marius1311 Nov 30, 2023 Maintainer

YitengDang
Nov 20, 2023

Marius1311
Nov 20, 2023
Maintainer

YitengDang
Nov 27, 2023
Author

Marius1311
Nov 29, 2023
Maintainer

YitengDang
Nov 30, 2023
Author

Marius1311
Nov 30, 2023
Maintainer