Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on knee plot analysis and matrix construction #467

Open
bulahwu opened this issue Oct 10, 2024 · 2 comments
Open

Questions on knee plot analysis and matrix construction #467

bulahwu opened this issue Oct 10, 2024 · 2 comments

Comments

@bulahwu
Copy link

bulahwu commented Oct 10, 2024

Hi, I’m just starting out in this field.
We have an scRNA-seq dataset that includes 438 million paired-end reads (2x150bp) from approximately 8000 cells sourced from tissue samples. We processed the dataset following the Drop-seq Alignment Cookbook protocol.
Here’s the knee plot derived from our dataset, based on the reads-per-cell-barcode table generated by BAMTagHistogram. The plots vary only in the x-axis limits (from left to right: all, 1M, 100k, and 10k cell barcodes).

knee_plot_dropseq_original_02

I have two questions:

  1. It’s challenging to estimate the cell count from this plot. What factors might contribute to the shape of it?
  2. Can the barcodeRanks and emptyDrops functions from DropletUtils be used to identify the knee point? Since these tools require a raw matrix, do you have any suggestions for setting DigitalExpression parameters to construct such a matrix?

Thank you very much for your help!

@jamesnemesh
Copy link
Collaborator

Hi! This is a pretty ancient way to evaluate how many cells you have in your data, and you can see it's pretty challenging.

For single cell data (but not nuclei data) I would absolutely try using DropletUtils. You want to generate a matrix that contains both your cells and some empty droplets. One way to capture both that works in most situations is to provide an argument to MIN_NUM_TRANSCRIPTS_PER_CELL=20. This will filter the output matrix to cells with at least 20 UMIs, which will help the matrix not be huge, while still capturing the "empty" droplets, which will probably have many more transcripts.

@bulahwu
Copy link
Author

bulahwu commented Oct 11, 2024

Thank you so much for your help! I followed your advice and was able to identify the knee point (1587 recovered from 8000 targeted cells).

I have a quick follow-up question regarding library prep quality and cell number identification.

Since I sequenced 2x150 bp, Read 1 includes sequences downstream of the barcode. I’m wondering if I could loosely define the bead motif as: CB_UMI +/- 2nt, followed by a poly-T stretch >= 10nt, and then calculate the ratio of Read 1s that contain this motif versus those that don’t. Running a command like grep -E '^[ATCGN]{len_barcode -2, len_barcode +2}T{10,}' Read_1_file | wc -l on the dataset referred to above (my sample #​1) revealed that about 55% of Read 1s contained this hypothetical bead motif. For comparison, I also looked at some publicly available datasets.

beads_ratio_03

Do you think this could be a rough indicator that the library prep isn’t optimal, and that sequencing further may not be justified? For example, in my sample #​3, only around 2.8% of Read 1s have this motif, and we recovered 77 from 8000 targeted cells. I'm unsure if full sequencing is worth pursuing.

Thank you again for your insights!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants