Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing genotypes even after imputation #58

Open
EveTC opened this issue Nov 2, 2021 · 7 comments
Open

Missing genotypes even after imputation #58

EveTC opened this issue Nov 2, 2021 · 7 comments

Comments

@EveTC
Copy link

EveTC commented Nov 2, 2021

Hello,

I have successfully imputed the genotypes of interest which has greatly improved the missinigness. However, after imputation there is still some missingness in the data - see plots below. Is there a way to force STITCH not to leave any missingness? I am hoping to apply a method where no missingness is allowed.

Missingness prior to imputation (site level) Missingness post imputation (site level)
Eng_MissPreImpute_siteLevel Eng_MissPostImpute_siteLevel

Thank you for your help in advance

@rwdavies
Copy link
Owner

rwdavies commented Nov 2, 2021

If you use the dosages, there is no missing data. Similarly, you have the genotype posterior probabilities for each sample for each of the three genotypes, and from this, you can reset the GT value to always be 0/0, 0/1 or 1/1 based on whichever has the highest genotype posterior. You can do this either manually in the VCF (using e.g. R or Python etc), or I imagine bcftools or perhaps the GATK has this functionality

@EveTC
Copy link
Author

EveTC commented Nov 2, 2021

Hi @rwdavies - thank you so much for your speedy response!

My apologies but I am very new to imputation - what do you mean by "dosages"?

Thank you :)

@rwdavies
Copy link
Owner

rwdavies commented Nov 2, 2021

Sure, happy to explain.

So practically these are in the DS entry in the per-sample, per-SNP entry in the VCF. e.g.

GT:GP:DS        0/1:0.025,0.975,0.000:0.975     1/1:0.000,0.062,0.938:1.938

in the header of the VCF file it says

##FORMAT=<ID=GT,Number=1,Type=String,Description="Best Guessed Genotype with posterior probability threshold of 0.9">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Dosage">

Now as to what they represent. First start with the GP entries the genotype posterior probabilities for the genotypes 0/0, 0/1 and 1/1. These are, under the model, P(Genotype = 0/0 | Data, parameters), for each of 0/0, 0/1 and 1/1. As there are only three genotypes under consideration, these probabilities will sum to 1.

Now the "dosage" DS entry is the expected number of alternate alleles. It is
dosage = Eexpected # of alternate alleles = sum genotype * P(genotype | data, parameters)
This means it is 0 times the genotype posterior probability for a genotype of 0/0, 1 times the GP for a genotype of 0/1, and 2 times the GP for a genotype of 1/1. It is therefore a value scaled between 0 and 2.

Finally the genotype entry GT, in STITCH, is the most likely genotype, but scaled to missing if the GP is less than 0.90.

So e.g. GT:GP:DS 0/1:0.025,0.975,0.000:0.975 is a sample that has genotye posterior probabilities of 0.025 for hom ref (0/0), 0.975 for het (1/1), and 0.000 for hom alt (1/1). This gives the corresponding dosage (0 * 0.025 + 0.975 * 1 + 0 * 0), and similarly, as the most likely genotype posterior is for the het genotype (0/1), and is above 0.90 (0.975 > 0.90), the GT entry is 0/1

Hope that helps, let me know if you have more questions

@EveTC
Copy link
Author

EveTC commented Nov 2, 2021

Oh my goodness - this makes so much more sense now! Fantastically well explained - thank you so much.

So I need to find a way to either 1) change the "but scaled to missing if the GP is less than 0.90" so that it is more lenient? or 2) use the posterior probabilities (or dosage) to update the ./. to the correct GT?

Any ideas to point me in the right direction? I cant seem to find anything in bcftools

Thanks again

@rwdavies
Copy link
Owner

rwdavies commented Nov 2, 2021

I didn't find anything obvious in bcftools myself, though maybe I didn't look hard enough.

I thought my Python was good enough to write this in 5 minutes but took more like 20. Oh well! Below should work with something like the following. It does the thing I just described. I hope it works, please do some sanity checking on your real data.

gunzip -c ~/proj/STITCH/test-results/mouse_tests/whole_region/stitch.chr19.vcf.gz | python temp.py - 
import sys

for line in sys.stdin:
    if line[0] == "#":
        print(line)
    else:
        x = line.split("\t")
        print(len(x))
        for i in range(9, len(x)):
            x2 = x[i].split(":")
            gp = [float(a) for a in x2[1].split(",")]
            max_index = gp.index(max(gp))
            if max_index == 0:
                x2[0] = "0/0"
            elif max_index == 1:
                x2[0] = "0/1"
            elif max_index == 2:
                x2[0] = "1/1"
            x[i] = x2[0] + ":" + x2[1] + ":"  + x2[2]
        print("\t".join(x).replace('\n',''))

@EveTC
Copy link
Author

EveTC commented Nov 3, 2021

This is incredible - thank you so much. What took you 20 minutes would have likely taken me weeks!
I will have a go with it today and check that it works - thank you so much again for all your help.

@EveTC
Copy link
Author

EveTC commented Nov 3, 2021

It works! 👍 One thing I have noticed is that it inserts a number (for me its 543) before each SNP record. I can just remove this with something like sed but thought I would let you know.

I have just ran this line of code to remove any line that contain 543 and any empy lines as well:
sed -e '/^543$/d' infilled.vcf | sed '/^[[:space:]]*$/d' > infilled_ed.vcf

Thank you again for your help, you have saved me alot of time and headaches!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants