-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropulation doublet rate very high #339
Comments
Hi,
I have not seen that sort of behavior. I have seen cases where the cell barcodes captured had very few informative UMIs, and doublet detection (and donor assignment) had a hard time figuring out what was going on. For example, if you try to run donor assignment on “empty” cell barcodes (where only cell free RNA is captured) performance of those cells is pretty bad.
Is that the result of one cell capture experiment? I ask because that’s a very large number of cells to have collected within a single run. If so, I would look at cell selection - you wouldn’t expect to capture that many cells in a single experiment. How many informative UMIs do you have for each cell for both AssignCellsToSamples and DetectDoublets - it would be good to have 200-500 informative UMIs per cell. If your number of informative UMIs is low (the programs output that as a column), then there may be problems with the sequencing depth (or other library prep) of your data, or your VCF may be too sparse. I would try plotting histograms of the number of informative UMIs for each program’s output. You might also create scatter plots with one axis being the total number of UMIs for a cell, and the other axis being the number of informative UMIs. You should see something like 20-50% of your UMIs for each cell are informative.
How many donors are in your VCF? Is your VCF from array data, or WGS? Did you attempt to demultiplex with all donors or a restricted list? When I run AssignCellsToSamples, I like to have the program consider many donors as possible assignments and be “blind” to which donors are actually in the pool, so I can validate that the donors assigned are the correct ones. If things are working, AssignCellsToSamples should be able to tell you which donors are in your pool without any “hints”, and do so before doublet detection.
Hopefully a few diagnostic plots will help us get a little closer to what’s going on here.
…-Jim
On Jun 23, 2023, at 3:32 PM, Terence Li ***@***.***> wrote:
Hi,
Thanks for providing this tool! We have four individuals multiplexed in a 10X Multiome fibroblast to stem cell reprogramming experiment. I'm writing because I have some strange demultiplexing results - almost all of the doublet p-values are >= 0.9 for both ATAC and RNA. I used the parameters as recommended in this discussion (namely MAX_ERROR_RATE = 0.05): #321 <#321>. I've attached the parameters for RNA/ATAC below.
drop_params.txt <https://github.com/broadinstitute/Drop-seq/files/11851525/drop_params.txt>
In the previous discussion, you mentioned that doublets are the cells where doublet_pval >= 0.9. However, using that threshold, we get ~98% of the cells called as doublets. Even using the doublet_pval threshold as 1.0, we see a high doublet proportion:
<https://user-images.githubusercontent.com/42584149/248375992-61ed4d2b-68c1-44be-b0e1-06a5c3910f39.png>
I was wondering if you had ideas regarding this behavior, perhaps further tuning MAX_ERROR_RATE? Let me know if you'd like me to provide additional details, such as the full output of dropulation.
—
Reply to this email directly, view it on GitHub <#339>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCZXJ2Q44IPEJPBMEIMWTTXMXVODANCNFSM6AAAAAAZR6PKJM>.
You are receiving this because you are subscribed to this thread.
|
Thanks for the fast response! These results are coming from three different runs (across different timepoints in the experiment) - here are the cell numbers from CellRanger: Timepoint 1: 9487 There's some issues with the mapping for timepoints 2/3, the coverage per cell is much lower in those than in timepoint 1. We have four individuals multiplexed in each timepoint, and the donor VCF I'm giving as input only includes those four individuals. It's array data, and we imputed SNPs with the TopMed server. If I'm remembering correctly, we originally expected around 12,000-15,000 cells, but CellRanger ends up calling a lot more for timepoints 2/3 - would it be informative to limit the number of total cells that it can detect? |
Hi!
Those are some helpful plots! To give you an idea of what I normally would expect from a decent data set, I’m sharing a few of our standard plots:



As you can see, we have a fairly large number of informative UMIs for each cell, and the doublet p-value distribution is nicely bimodal - you could set the threshold almost anywhere between 0.1 and 0.9 and get the same result. A large proportion of the cells have the max (based on floating point math) scores, and few cells have FDRs > 0.05. This output is from a random experiment, and is fairly typical.
If you restricted your analysis to the set of cell barcodes that have at least 500 informative UMIs, does the data behave a bit better?
When you ran imputation, did you filter out low quality sites? I would filter to an R2 score >=0.9. We’ve found in practice that works quite well - the genotype data for the experiment above was generated with an array + imputation + filtering to high quality sites.
…-Jim
On Jun 23, 2023, at 5:32 PM, Terence Li ***@***.***> wrote:
Thanks for the fast response!
These results are coming from three different runs (across different timepoints in the experiment) - here are the cell numbers from CellRanger:
Timepoint 1: 9487
Timepoint 2: 21010
Timepoint 3: 37402
There's some issues with the mapping for timepoints 2/3, the coverage per cell is much lower in those than in timepoint 1.
I think that's what's happening for the doublet detection. Here are the plots for the number of informative UMIs and total vs. informative UMIs - the top left panel is across all ~70k cells across all three timepoints, and the following panels are split by timepoint:
<https://user-images.githubusercontent.com/42584149/248408709-319598d5-df9d-4a98-bebb-423f6160c04e.png>
<https://user-images.githubusercontent.com/42584149/248408721-a59ec904-06d2-419f-94db-15ab7e682a18.png>
We have four individuals multiplexed in each timepoint, and the donor VCF I'm giving as input only includes those four individuals. It's array data, and we imputed SNPs with the TopMed server.
If I'm remembering correctly, we originally expected around 12,000-15,000 cells, but CellRanger ends up calling a lot more for timepoints 2/3 - would it be informative to limit the number of total cells that it can detect?
—
Reply to this email directly, view it on GitHub <#339 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCZXJ7GPGYG77KU5KME6ELXMYDPLANCNFSM6AAAAAAZR6PKJM>.
You are receiving this because you commented.
|
Hi @jamesnemesh and @terencewtli , Thank you for your inputs on this post. I wonder if @terencewtli has any updates or solutions of resolving this problem as I have really similar observations. Take one of my multiplexed sample for example (sequenced by NovaSeq X, ~25K cells after initial QC filtering when creating a Seurat object), cells in this multiplexed sample are from 4 individuals from 4 cell lines. Here's a bit of break down my procedures and the usage of
Any suggestions/inputs on these observations are much appreciated! Thank you so much! |
Hi James, Thank you so much for your inputs! While I'm testing some changes and checking VCFs, I have a quick question regarding the |
Technically, using either the 'raw' or processed VCF as input to |
Thank you for the information! I opted out |
Hi,
Thanks for providing this tool! We have four individuals multiplexed in a 10X Multiome fibroblast to stem cell reprogramming experiment. I'm writing because I have some strange demultiplexing results - almost all of the doublet p-values are >= 0.9 for both ATAC and RNA. I used the parameters as recommended in this discussion (namely
MAX_ERROR_RATE = 0.05
): #321. I've attached the parameters for RNA/ATAC below.drop_params.txt
In the previous discussion, you mentioned that doublets are the cells where doublet_pval >= 0.9. However, using that threshold, we get ~98% of the cells called as doublets. Even using the doublet_pval threshold as 1.0, we see a high doublet proportion:
I was wondering if you had ideas regarding this behavior, perhaps further tuning
MAX_ERROR_RATE
? Let me know if you'd like me to provide additional details, such as the full output of dropulation.The text was updated successfully, but these errors were encountered: