Questions on knee plot analysis and matrix construction #467

bulahwu · 2024-10-10T15:00:06Z

Hi, I’m just starting out in this field.
We have an scRNA-seq dataset that includes 438 million paired-end reads (2x150bp) from approximately 8000 cells sourced from tissue samples. We processed the dataset following the Drop-seq Alignment Cookbook protocol.
Here’s the knee plot derived from our dataset, based on the reads-per-cell-barcode table generated by BAMTagHistogram. The plots vary only in the x-axis limits (from left to right: all, 1M, 100k, and 10k cell barcodes).

I have two questions:

It’s challenging to estimate the cell count from this plot. What factors might contribute to the shape of it?
Can the barcodeRanks and emptyDrops functions from DropletUtils be used to identify the knee point? Since these tools require a raw matrix, do you have any suggestions for setting DigitalExpression parameters to construct such a matrix?

Thank you very much for your help!

jamesnemesh · 2024-10-10T15:17:25Z

Hi! This is a pretty ancient way to evaluate how many cells you have in your data, and you can see it's pretty challenging.

For single cell data (but not nuclei data) I would absolutely try using DropletUtils. You want to generate a matrix that contains both your cells and some empty droplets. One way to capture both that works in most situations is to provide an argument to MIN_NUM_TRANSCRIPTS_PER_CELL=20. This will filter the output matrix to cells with at least 20 UMIs, which will help the matrix not be huge, while still capturing the "empty" droplets, which will probably have many more transcripts.

bulahwu · 2024-10-11T13:39:51Z

Thank you so much for your help! I followed your advice and was able to identify the knee point (1587 recovered from 8000 targeted cells).

I have a quick follow-up question regarding library prep quality and cell number identification.

Since I sequenced 2x150 bp, Read 1 includes sequences downstream of the barcode. I’m wondering if I could loosely define the bead motif as: CB_UMI +/- 2nt, followed by a poly-T stretch >= 10nt, and then calculate the ratio of Read 1s that contain this motif versus those that don’t. Running a command like grep -E '^[ATCGN]{len_barcode -2, len_barcode +2}T{10,}' Read_1_file | wc -l on the dataset referred to above (my sample #1) revealed that about 55% of Read 1s contained this hypothetical bead motif. For comparison, I also looked at some publicly available datasets.

Do you think this could be a rough indicator that the library prep isn’t optimal, and that sequencing further may not be justified? For example, in my sample #3, only around 2.8% of Read 1s have this motif, and we recovered 77 from 8000 targeted cells. I'm unsure if full sequencing is worth pursuing.

Thank you again for your insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on knee plot analysis and matrix construction #467

Questions on knee plot analysis and matrix construction #467

bulahwu commented Oct 10, 2024

jamesnemesh commented Oct 10, 2024

bulahwu commented Oct 11, 2024

Questions on knee plot analysis and matrix construction #467

Questions on knee plot analysis and matrix construction #467

Comments

bulahwu commented Oct 10, 2024

jamesnemesh commented Oct 10, 2024

bulahwu commented Oct 11, 2024