Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no vcf file provided - SNPs with called genotype are imputed #97

Open
SelinaKlees opened this issue May 30, 2024 · 4 comments
Open

no vcf file provided - SNPs with called genotype are imputed #97

SelinaKlees opened this issue May 30, 2024 · 4 comments

Comments

@SelinaKlees
Copy link

Hello,
I have WGS sequencing data, incuding the bam files and the called SNPs as vcf. I used now stitch to impute my SNPs based on the bam files and the SNP positions. However, I realized since I am not providing the vcf, Stitch also imputes genotypes that are actually not missing in the vcf files. So many genotypes of the called SNPs and the ones obtained from Stitch are not the same. Actually, I only want to impute the genotypes which are missing in my vcf. Is there a way to do that with Stitch?
Best,
Selina

@rwdavies
Copy link
Owner

rwdavies commented Jun 3, 2024

Hi,

What's the depth of coverage? (And how many samples?) If the depth of coverage is higher, you might not want to use STITCH. If the depth of coverage is lower, you might not want to trust your called genotypes from the WGS directly.

STITCH does not have a way to simply impute some missing genotypes. If this is something like RADseq or GBS or similar, my suggestion would run everything through STITCH, then make a merged set with SNPs and genotypes from the RADseq, and then for SNPs not meeting a certain QC filter in the RADseq or GBS, use the imputed ones

Thanks,
Robbie

@SelinaKlees
Copy link
Author

SelinaKlees commented Jun 13, 2024

Hi Robbie,
thanks for the reply!
I have WGS data for different coverages of 96 samples. Originally, we have ~22x but then we downsampled the reads to 8x, 4x, 2x, 1x, and 0.5x coverage. We wanted to compare these datasets to answer the question "how low can we go?". So I used STITCH for each of the six coverages.
So does this mean STITCH can be seen rather as a new variant calling for low coverage sequencing data, rather than imputation of missing genotypes in an existing vcf file?
Best,
Selina

@rwdavies
Copy link
Owner

If you have data at high coverage (>10 X), you probably don't need to impute, if you can tolerate a moderate missing data rate, filtering out genotypes with low GQ (say below 10 or 20)

I would say that in its primary purpose, STITCH is neither a variant caller, nor designed for imputation of individual missing genotypes. It's designed for quite low coverages (<2X), where individual genotyping of variants in samples is impossible. It also doesn't do variant calling per-se, though it can help better determine which variants are likely true positive, as those variants that agree with their imputed background (have a high INFO score).

Hope that helps. One last comment, 96 samples is good, but at 0.5X, you might see much better accuracy if you imputed many more samples (e.g. 1000 samples). So I would take any results you get at the lower coverage as advisory, rather than definitive, if that makes sense (i.e. assume things might get better for more samples) (see the STITCH paper, we have a figure about this)

@SelinaKlees
Copy link
Author

Thank you for the comment and advice! Yes, the low sample size will definitely be a discussion point in the manuscript.
Best,
Selina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants