-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output for ancestral haplotype probabilities? #89
Comments
So I believe you're interested in this
I get a VCF with entries like
where HD is defined as
so for instance for the sample above at that SNP, is ancestral haplotype dosage for the 4 ancestral haps is Let me know if this is what you want or if you wanted something else? |
Thank you for your quick, great explanation :) I think that's indeed what I'm looking for. It would however be hard to postprocess the VCF to simply obtain the most likely founder for every SNP/offspring. Something like: I would like to emulate what Scott et al, 2016, did with STITCH to produce figure (c) Figure
|
Hi, I'm one of the authors on the paper with the figure you show. I did not create the figure but I think it was done as follows. First, there is a perl script vcf2df.pl to extract the Haplotype dosages from the STITCH VCF, available from https://github.com/michaelfscott/DIVERSE_MAGIC_WHEAT Second, I think the plots are examples of stacked density charts, which can be produced as described here https://r-graph-gallery.com/135-stacked-density-graph.html Richard |
Thank you Richard, I will follow your instructions. Have a good day. Jose |
Jose - you could also ask Mike Scott [email protected] exactly how he made the figure |
Hello again, I followed your guidelines and generated files with the founders of highest dosage for every SNP. The problem is that I am observing unexpectedly large numbers of recombination events, despite of setting up low expRate values. Do you think of any approach that could help ameliorate this? I noticed the option outputHaplotypeProbabilities. Do you think this would work better for haplotype reconstruction? Best regards & thanks for the help provided, |
The method can struggle and "get stuck" in local minima which have artifactual recombinations between ancestral haps. The method does try to find these and resolve them, but it's not a fully rigorous approach. One approach I use to evaluate this is to turn plot_shuffle_haplotype_attempts on, to get a representation of what the use of ancestral haplotypes looks like, on a per sample level, for some of the samples. Feel free to post an example back here. Another approach is to increase the total number of iterations, and increase the number of heuristic iterations to find and recover these ancestral haplotype recombination errors (shuffleHaplotypeIterations). So e.g. set 60 iterations, and set shuffleHaplotypeIterations = c(4,8,12,16,20,24), or similar. Maybe also disable splitReadIterations To confirm, are you using K=16 for your run of STITCH? What's your total N? What does the plot of hapsum look like, does it seem like it's using all K=16? Robbie |
By total N, I meant your total sample size. For what you gave above When you say "every run takes 16 founders and 2 additional RILs.", how many RILs do you have? The ideal way to run STITCH would be all samples at the same time (i.e. all RILs). I agree the first plot looks far messier than I would like to see (though the interpretation of this depends on N). There are multiple samples that recombine at the same location. It certainly suggests the heuristics could use some work. For shuffleHaplotypeIterations, I would recommend turning those off the last 10 or 20 or so iterations, as in that process (the shuffleHaplotypeIterations), it also adds a lot of noise, to help finding good global results. So the last 10 or 20 iterations helps get rid of that noise There should be a HapSum plot that looks much blockier? A rectangle full of colours |
Hi @rwdavies, I tried with some of the changes you suggested you suggested last week, like nIterations = 100, shuffleHaplotypeIterations = seq(4, nIterations-20, 4) and splitReadIterations, besides some other like increasing maxRate or shuffle_bin_radius. Despite getting plots with less shuffles sometimes, it's really hard to reconstruct a realistic mosaic of the haplotypes with the highest dosage as there are still too many short haploblocks. Regarding the families, I have 26 of them and, in each one, brassica napus plants developed from 16 inbred founders were crossed through four generations and 2 RILs were developed by selfing the same individual from the last generation. Therefore, I'm combining the 2 RILs with the 16 founders of the same family for each STITCH run, since I understand that adding other unrelated individuals from other families might confuse the program. Coming back to your question, would it be N=18? Here's the hapSum plot. Best regards, |
Hi again, I just realized that some misconceptions might have been dragging me behind. I have high-coverage data for all 18 samples. However, I was just adding the founders' high-coverage data to the genfile. Also, I was assuming that K='number of founders' should match with the number of individuals in the genfile. Perhaps using the RILs high-coverage data would improve haplotype reconstruction. When I get the haplotype dosages, the founders are numbered from 1 to 16. Do these numbers make any sort of reference to the actual samples used? In other words, is the genfile data used for more precise imputation or as the founders for reconstructing haplotypes among the RILs? Also, given that I have 26 families, does it make sense to use S=26 with K=16? Sorry if the questions sound stupid. I hope you can throw some light into this. Best regards, |
Hi again, When I added only the genotypes of the RILs and no information about the founders, the number of haplotypes was reasonably ok. I think the problem was that I thought STITCH took samples in the gen file as founders to impute from but apparently they need to be supplied as reference. We could resolve the issue here. I started running with reference files (hap, legend and samples). I'm using a single VCF file for all founders, splitting them by populations (16way MPPs) on the samples file, and running STITCH with a different gen file for each pair of RILs for each population. I'll post a new issue on an error related to the execution of STITCH with reference files. Best regards & thank you very much for your help, |
Hi, OK taking a step back, and re-reading your original message, it's probably good to go back over some things. The most beneficial use case of STITCH is when you have lots (hundreds to thousands) of low coverage (<2x) samples without any knowledge of how the samples are related. STITCH then does a pretty good job imputing. Here you have 16 inbred founders at high coverage, and for each of 26 populations, have 2 new RIL, with very few intervening generations, and those samples are also at high coverage, and want to paint those samples as mosaics of the founders, correct? @rmott might have suggestions as this is much more up his alley, and is a classic problem, though STITCH can do this too, though you want to turn off any EM in STITCH for this You'd want something like this example Hope that helps, let me know if that makes sense and is useful? Best, P.s. Sorry for my slow reply |
Hi, Thank you for your suggestions.
That's correct. Interestingly, I read that Scott et al, 2021, who had a very similar population structure, turn off iterations as your suggested. Besides, they also turn off reference_shuffleHaplotypeIterations. Would you also advice that?
I changed the method. Now I called variants jointly among the 16 founders of one family and then used the output VCF files to produce the reference files. I applied exactly your proposed parameters to STITCH and run it for this family on the bam files of the RILs (without gen). I will keep you updated :) All best, |
Hi again, After implementing changes, plots are beginning to look much better and the recombination patterns make more sense. I have one question. I realized that when I provide the sample file and the population containing 16 individuals, it raises error if the haplotype file contains 16 columns, as it requires it to have 32. In that case, K must be 32, or the same error is raised. Is there any way of specifying 16 founders and 16 reference samples together? Something like the samples are homozygous. Is there any advantage of using the samples? Thanks again, |
Sorry the way it's done currently, it assumes that reference haplotypes are from "individuals" that are diploid, whereas you have diploid homozygous chromosomes, so "individuals" isnt' quite right, as you want to just give the haplotypes once. You shouldn't need to give a sample file though right? That's only needed if you want to do things like exclude populations on the fly, for instance if you were using a 1000 Genomes (humans) dataset, and prepared it once, but wanted to make separate panels for certain subpopulations Also, sorry for my slow reply |
Hi,
I have 16way multiparental populations with 2 RILs per population. The aim is to reconstruct the haploblocks from the 16 founders along the genome of the RILs. Therefore, I'm not so interested in obtaining imputed genotypes but rather in the output dosages or haplotype probabilities. My question is whether and where I can find some kind of output containing the ancestry probabilities. I checked the output of the example and couldn't find this.
Best regards,
Jose
The text was updated successfully, but these errors were encountered: