-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samples with a large amount of missing sites after imputation #62
Comments
Hi Guy, If there are 4 founders, then the max number of ancestral haplotypes could be is 8, so I would recommend increasing K to that. It looks like you have enough coverage and samples to support it. Setting gridWindowSize to the default value (I think NA) would make things a bit more accurate but slower. Related, the recombination rate for Drosophila looks to be a bit higher than humans / mice (what default parameters are set for) (based on a 30 second internet search, e.g. STITCH can take in reference haplotypes, and initialize the EM with that. That sounds like a good option for you if you have a good guess for the founding haplotypes. Take a look at the reference_haplotype_file and reference_legend_file options. If you think your inferred founder haplotypes contain no phase switch errors, you can either set niterations to 1 (to impute using only information from the reference haplotypes), or generally to something low like 5 or 10 (which would use the founder haplotypes to initialize, then proceed normally) (you'd probably want to turn off shuffleHaplotypeIterations here). If you think your inferred founder haplotypes might contain phase switch errors, you can set niterations to something like 20, then keep a few shuffleHaplotypeIterations e.g. 4, 8, 12. shuffleHaplotypeIterations looks for places where breaking and re-setting ancestral haplotypes improves the likelihood, i.e. here phase switch errors among founders. I would recommend disabling splitReadIterations and refillIterations if you go with reference haplotypes, especially if you're reasonably sure you have all the founder haplotypes, as they won't be necessary. Best, |
Dear Robbie Thanks so much for this. I have tried to implement it. Thanks again Cheers Guy Loading required package: rrbgen gzip: /home/reeves/diary_nov_stitch/ref_file/ref_legend.txt: not in gzip format |
Apologies, I'll add that to the documentation |
I believe the haplotype, legend and recombination map must be gzipped. I can't remember for which it is optional or not |
Hi Robbie I am getting this error [2021-12-10 00:12:44] Load and validate reference legend header But as far as I can tell the reference_legend_file is correctly formatted With spaces being the column delineators. Or have a misunderstood something about the IMPUTE format. Are there example files I could look at and have missed? Thanks Guy |
I think the problem was that in the reference_legend_file and the haplotype file I had some INDELS (length is greater than 1bp) when I got rid of them from both files the error disappeared |
Hi, For the first error, If you run through the examples, human example 2 uses a reference panel
where the important columns are position, a0 and a1, and the rest ignored Separately, yes, it is true that STITCH cannot accomadate indels properly, so they are not allowed in the input files. Glad (I think) things are working properly for now. I'll see if I can update the README to make this clearer Best, |
Hey, To make things clearer for future users, I updated the README page. Can you please take a look, and let me know if that would have answered the questions you had when you were working through this? Thanks |
First of all I would like to say that I am embarrassed that I forgot the header to the .gz reference_legend_file (rsID position a0 a1), I don't know how that happened I thought I had checked and rechecked that. I read what you wrote in the the README page and it is very clear, thank you very much. I just want to check even if you are using niterations==1 and nhaps == K,that the output ,vcf should still be considered unphased? Thanks Guy |
Great! Sorry I didn't build a phaser into STITCH. Probably Beagle or IMPUTE2 would be your best bet on the output. If I had time I'd love to write a new version of STITCH, if and when I do, I'll definitely include phasing. |
HI
HI
I have been running Stitch on a 5838 sample dataset that is an 11 generation drosophila pedigree, which all originate from 4 founder individuals . While 66% of the samples have < 2% missing sites the remaining 33% have higher missing rates up to 50%. High missing rate appears to be clustered in family trios (but not all siblings have high high missing rates).
Does not appear correlated mean genome coverage or depth (see graph).
Do you have any suggestions how to reduce this amount of missing data without compromising quality of imputation too much (the mendelian error rate in most families is low, again high values are again focused particular families).
Thanks
Guy
STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "sample_list5838.txt" , posfile = "posChr2Lnoindel.txt", outputdir = paste0(getwd(), "/"), K = 5, nGen = 11, nCores = 10, outputSNPBlockSize = 1000, gridWindowSize = 10000, outputInputInVCFFormat = TRUE, regenerateInput = FALSE, originalRegionName = "chr2L", regenerateInputWithDefaultValues = TRUE)
Ps I have I think phased the four founders would trying QUILT have advantages?

The text was updated successfully, but these errors were encountered: