-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropulation high doublet call rate #321
Comments
Hi!
I would consider doublets the cells where the doublet_pval >= 0.9.
Additionally, you might try to set the MAX_ERROR_RATE to a value like 0.05. We’ve found that if there’s a fair amount of cell free RNA that many cells are assigned as doublets because of the capture of that RNA from other donors. If you abhor an arbitrary threshold and use celbender (https://github.com/broadinstitute/CellBender <https://github.com/broadinstitute/CellBender>), you can feed in per-cell allele frequency and estimates of the fraction of cell free RNA captured by each cell instead. Those options are captured by CELL_CONTAMINATION_ESTIMATE_FILE and ALLELE_FREQUENCY_ESTIMATE_FILE.
As you’ve correctly identified, without some additional info DetectDoublets is overly sensitive to cell free RNA and overcalls doublets.
…-Jim
On Oct 12, 2022, at 8:25 AM, drneavin ***@***.***> wrote:
Hello,
I'm writing because I have recently trialed dropulation on a couple datasets and have found higher doublet rate estimations than anticipated and want to know if some of the parameters that I am using might be incorrect or if additional parameters should be set. I have run dropulation on two different datasets:
74 pools of PBMCs with ~16 individuals per pool and ~20,000 cells captured per pool
11 pools of fibroblasts with ~8 individuals per pool and ~10,000 cells captured per pool
I find that doublet estimation to be ~60-70% but it should be between 5 and 20%. I have also run other demultiplexing methods and find the doublet rate to be within the expected range. The commands that I'm currently running are:
AssignCellsToSamples --INPUT_BAM tagged_bam.bam --VCF Merged_MAF0.01.dose_GeneFiltered_hg38_nochr_NoAdditionalchr.vcf --OUTPUT assignments.tsv.gz --VCF_OUTPUT out_vcf.vcf --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --FUNCTION_TAG XF --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --RETAIN_MONOMORPIC_SNPS false --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y -IGNORED_CHROMOSOMES MT --ADD_MISSING_VALUES true --DNA_MODE false --SNP_LOG_RATE 1000 --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
and
DetectDoublets --INPUT_BAM tagged_bam.bam --VCF out_vcf.vcf --SINGLE_DONOR_LIKELIHOOD_FILE assignments.tsv.gz --OUTPUT likelihoods.tsv.gz --CELL_BARCODE_TAG CB --MOLECULAR_BARCODE_TAG UB --CELL_BC_FILE barcodes.tsv --SAMPLE_FILE Individuals.txt --EDIT_DISTANCE 1 --READ_MQ 10 --GQ_THRESHOLD 30 --FRACTION_SAMPLES_PASSING 0.5 --IGNORED_CHROMOSOMES X --IGNORED_CHROMOSOMES Y --IGNORED_CHROMOSOMES MT --FORCED_RATIO 0.8 --TEST_UNEXPECTED_DONORS true --SCALE_LIKELIHOODS true --DNA_MODE false --GENE_NAME_TAG gn --GENE_STRAND_TAG gs --GENE_FUNCTION_TAG gf --STRAND_STRATEGY SENSE --LOCUS_FUNCTION_LIST CODING --LOCUS_FUNCTION_LIST UTR --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
And then I consider the calls from the DetectDoublets command in the bestSample column to be the droplet calls. Please let me know if you identify anything that could be optimized with this submission or if calling the final cell types should be altered.
Thanks so much for your help!
—
Reply to this email directly, view it on GitHub <#321>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCZXJZ2CVCDVAYPB3BMHCDWC2U25ANCNFSM6AAAAAARDHSAPU>.
You are receiving this because you are subscribed to this thread.
|
I should include some lame R code here to give you a sense of how we generate a mapping from cell barcode to donor ID. Someday I should actually document this whole process with a cookbook as I've done for some of our other tools.
|
Great, thank you @jamesnemesh! That code will help and I'll try changing the MAX_ERROR_RATE as well. For what it's worth, I don't think that these pools have unusually high ambient RNA (< 10%). Would that be enough to impact the doublet calling? Also, I noticed that the default ratio for doublet proportions is 0.8:0.2. I was curious about the reason for this ratio as opposed to 0.5:0.5 and am wondering if altering this would have a large impact? |
Hi,
The way we look for doublets is to optimize a mixture that maximizes the likelihood of the data, and we cap the ratios at 4:1 (80/20). The reason for the cap to be up to (but not including) 1, most cells have enough errors (PCR, cell free RNA, genotyping) that doublet detection would optimize those cells as a mixture of mostly the correct donor and a very small fraction of some random donor that best matched up with the errors observed. The 0.8:0.2 is a tradeoff for how selective we want to be.
In real world data, the size of cells can easily vary by an order of magnitude of UMIs, so it’s pretty plausible to have cells that are not 50/50 mixtures. If you take a look at the first figure of the supplemental, we examine our sensitivity to detect doublets at different proportions. Using in-silico mixed data, the 50/50 doublets are super easy to detect, and you need more data to detect the more skewed doublets. If another algorithm you’re looking is tuned to only detect 50/50 doublets, then no doubt you’ll detect fewer doublets, but I’d bet you’ll miss at least some of the true events. You could certainly try to set a FORCED_RATIO=0.5 and see how that works (I haven’t tested that because of the real world observations.). As for the <10% ambient RNA, I think my particular implementation gets pretty touchy at more than a few % ambient RNA, which is why the MAX_ERROR_RATE is there - it was a “cheap” fix that got our doublet rates from clearly being over-called to being close to the expected numbers.
I should mention that there are two patterns to doublets that you can detect. One are doublets where both the doublet_pval and best_pair_pvalue are > 0.9. We refer to that internally as “confident doublets”. These are cells where the model thinks there are alleles that can’t be explained by the primary donor, AND is highly confident in who the 2nd donor is. There are also cells where the error rate may be higher, but the model can’t decide on a particular donor as the 2nd best of the pair - many donors give similar results. These are cases where doublet_pval>=0.9 and best_pair_pvalue<0.9. We think the estimate of the fraction of doublets that matches up with loading best are the “confident” doublets. In most experiments, the non-confident 2nd donor doublets are a much smaller fraction, and by default we usually toss them to be conservative. However, in some data we’ve looked at recently (multiome data) the errors rates were quite a bit higher, and we found we could retain those non-confident 2nd donor doublets in the data set. This was particularly useful for eQTL analysis, where the additional data in the non-confident doublets improved our ability to detect eQTLs above any additional variance we added to the data - for a different task you might want to be more conservative.
It sounds like you’re doing a bake-off of a few existing algos. You might look at GenerateSyntheticDoublets, which helps you generate BAMs with in-silico mixed doublets - you can select cells from real experiments that you are confident are singlets, then mix them in different proportions and see if your algo of choice can detect them at the given mixture.
In any case, I’m interested to hear how the results go! Let me know if there’s anything else I can help you with.
…-Jim
On Oct 12, 2022, at 5:33 PM, drneavin ***@***.***> wrote:
Great, thank you @jamesnemesh <https://github.com/jamesnemesh>! That code will help and I'll try changing the MAX_ERROR_RATE as well.
For what it's worth, I don't think that these pools have unusually high ambient RNA (< 10%). Would that be enough to impact the doublet calling?
Also, I noticed that the default ratio for doublet proportions is 0.8:0.2. I was curious about the reason for this ratio as opposed to 0.5:0.5 and am wondering if altering this would have a large impact?
—
Reply to this email directly, view it on GitHub <#321 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCZXJ3S62AKD43E6JSXXHDWC4VEJANCNFSM6AAAAAARDHSAPU>.
You are receiving this because you were mentioned.
|
Hi @jamesnemesh, I'm happy to report that changing the You're right, I'm comparing the different demultiplexing and doublet detecting methods - let me know if there are any additional parameters that you think I should test for dropulation. Thanks for pointing out Thanks again for all your help! |
Hi,
On the speed front: because we treat each cell as an independent observation, you can run the donor assignment programs on subsets of the data. That program's documentation is a little hard to parse the first time, so here's an example:
The other thing that can help is to use a BCF instead of a VCF file. Because I'm using the HTSJDK implementation of a BCF reader, you need to use Picard's |
@drneavin If you don't mind, I had a few more thoughts on additional bake off scenarios you might consider.
Our programs work just as well on related individuals and unrelated individuals, including full trios.
We run AssignCellsToSamples without a SAMPLE_FILE. Because our program runs fairly quickly, we can use a VCF with 1000+ individuals (and pool sizes in excess of 100 individuals) easily. When a donor is introduced that is in the VCF, we'll identify that donor properly for the set of all possible donors. When the donor introduced is not present in the VCF (a cryptic donor), the single donor assignment qualities will be poor, and the cells will all be flagged as mixtures of two known donors. This makes sense, as a combination of two random donors would better explain a random set of alleles than a single individual. It's possible Vireo would do really well with cryptic donors as it has fewer priors - perhaps you want to keep the cells from that donor and know that they all come from one true (but unknown) source instead of being thrown away. We've encountered plenty of plate swap and cryptic donor issues, especially when working with 3rd party sources of samples, so this sort of utility is incredibly important for a production environment. |
Thanks @jamesnemesh! I think that the first scenario will be hard for us to test with our current dataset but I think there may be some publicly available that we might try. The second question is a good one and not one I've done a comparison with but I think some of the methods will clearly just call the missing individuals as doublets but depends on how the parameters are set. I'll have to think about the best way to compare this across methods Also, I was going to test out the |
For the first scenario, if you have a VCF with more samples in it than the set in your pool, you can run on the entire VCF. I imagine it's hard to get the wrong donor assignment and see where your algorithm is making mistakes if you only allow the algorithm to pick valid choices. As for GenerateSyntheticDoublets - looks like we’re not exposing a command line stand-alone for it, but it can be accessed from the jar:
|
Hello,
I'm writing because I have recently trialed dropulation on a couple datasets and have found higher doublet rate estimations than anticipated and want to know if some of the parameters that I am using might be incorrect or if additional parameters should be set. I have run dropulation on two different datasets:
I find that doublet estimation to be ~60-70% but it should be between 5 and 20%. I have also run other demultiplexing methods and find the doublet rate to be within the expected range. The commands that I'm currently running are:
and
And then I consider the calls from the
DetectDoublets
command in thebestSample
column to be the droplet calls. Please let me know if you identify anything that could be optimized with this submission or if calling the final cell types should be altered.Thanks so much for your help!
The text was updated successfully, but these errors were encountered: