-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing genotypes even after imputation #58
Comments
If you use the dosages, there is no missing data. Similarly, you have the genotype posterior probabilities for each sample for each of the three genotypes, and from this, you can reset the GT value to always be 0/0, 0/1 or 1/1 based on whichever has the highest genotype posterior. You can do this either manually in the VCF (using e.g. R or Python etc), or I imagine bcftools or perhaps the GATK has this functionality |
Hi @rwdavies - thank you so much for your speedy response! My apologies but I am very new to imputation - what do you mean by "dosages"? Thank you :) |
Sure, happy to explain. So practically these are in the DS entry in the per-sample, per-SNP entry in the VCF. e.g.
in the header of the VCF file it says
Now as to what they represent. First start with the GP entries the genotype posterior probabilities for the genotypes 0/0, 0/1 and 1/1. These are, under the model, P(Genotype = 0/0 | Data, parameters), for each of 0/0, 0/1 and 1/1. As there are only three genotypes under consideration, these probabilities will sum to 1. Now the "dosage" DS entry is the expected number of alternate alleles. It is Finally the genotype entry GT, in STITCH, is the most likely genotype, but scaled to missing if the GP is less than 0.90. So e.g. Hope that helps, let me know if you have more questions |
Oh my goodness - this makes so much more sense now! Fantastically well explained - thank you so much. So I need to find a way to either 1) change the "but scaled to missing if the GP is less than 0.90" so that it is more lenient? or 2) use the posterior probabilities (or dosage) to update the Any ideas to point me in the right direction? I cant seem to find anything in bcftools Thanks again |
I didn't find anything obvious in bcftools myself, though maybe I didn't look hard enough. I thought my Python was good enough to write this in 5 minutes but took more like 20. Oh well! Below should work with something like the following. It does the thing I just described. I hope it works, please do some sanity checking on your real data.
|
This is incredible - thank you so much. What took you 20 minutes would have likely taken me weeks! |
It works! 👍 One thing I have noticed is that it inserts a number (for me its I have just ran this line of code to remove any line that contain Thank you again for your help, you have saved me alot of time and headaches! |
Hello,
I have successfully imputed the genotypes of interest which has greatly improved the missinigness. However, after imputation there is still some missingness in the data - see plots below. Is there a way to force STITCH not to leave any missingness? I am hoping to apply a method where no missingness is allowed.
Thank you for your help in advance
The text was updated successfully, but these errors were encountered: