-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when many samples have no informative reads #102
Comments
Hey, is this scalffold of 700 bases long ? Given the command |
The scaffold is very short, but this command is running in a workflow that chops up regions by where there are SNPs, so the scaffold is slightly longer (867 bases). Is there some rule of thumb I can use to understand when/why STITCH will not be able to work on small scaffolds or short regions? |
STITCH should be able to generate non-called GTs. I suppose this error may be due to the parallel function. Try to reduce the number of cores to 1. |
So my solution for this was never very elegant. The code is here to make fake sample reads when the sample has no reads that intersect SNPs in the region of interest Are there perhaps 3 or fewer SNPs in this scaffolds? Otherwise I would suggest a simple rule might be to run regions (e.g. scaffolds) with more than 5,000 bases. I don't have a strong basis for this number it's just that to make the imputation efficient with low coverage data the region should be long enough to leverage information across different samples. If you reproducibly hit this error > 3 SNPs and > 5000 bases let us know, maybe something else is going on |
There are 106 SNPs, although many fewer than 5000 bases. I'll restrict to scaffolds with >5000 bases and see if that eliminates this kind of crash. If that works, I guess I could also try reducing to 1 core for those regions and see if they are able to run successfully. |
There are 106 SNPs in (762 - 64) bases? That doesn't obviously explain any error but that is a super high heterozygosity if these are true positive SNPs Are any of them at the start or end of the region exactly? I think I've written tests against those edge cases but wonder if something is still going on Another thing, you could set |
It's a plant genome, so I wouldn't be shocked if some of this was just bad reference. The regions are chosen by a program that breaks them up by SNPs, so yes, first and last base have a SNP in them. I can put a buffer in though so that isn't the case in the future. (I am running with a large value for |
A quick follow up: I restricted to regions > 5000bp and more than 6 SNPs, but I am still seeing crashes when many of the CRAMs have no informative reads in an area. The log below shows a crash in a region with 241 variants in 7025 bp: log+ STITCH.R --chr=Scaffold01615 --regionStart=69 --regionEnd=7094 --buffer=1000000 --cramlist tmp_cramlist --posfile=tmp_region_positions.pos --K=10 --nGen=20 --nCores=64 --refillIterations=NA --downsampleToCov=50 --outputdir=. --splitReadIterations=NA --reference ref.fa --tempdir . --keepSampleReadsInRAM=FALSE --outputSNPBlockSize 1000 --gridWindowSize 1000 --output_filename stitch.region-871.vcf.gz [2024-09-21 02:43:15] Running STITCH(chr = Scaffold01615, nGen = 20, posfile = tmp_region_positions.pos, K = 10, S = 1, outputdir = ., nStarts = , tempdir = ., bamlist = , cramlist = tmp_cramlist, sampleNames_file = , reference = Brassica_oleracea.BOL.dna.toplevel.fa, genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 69, regionEnd = 7094, buffer = 1000000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = NA, nCores = 64, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 1000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = stitch.region-871.vcf.gz, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = 1000, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000) [2024-09-21 02:43:15] Program start [2024-09-21 02:43:15] Get and validate pos and gen [2024-09-21 02:43:15] Done get and validate pos and gen [2024-09-21 02:43:15] There are 0 variants in the left buffer region -999931 <= position < 69 [2024-09-21 02:43:15] There are 241 variants in the central region 69 <= position <= 7094 [2024-09-21 02:43:15] There are 0 variants in the right buffer region 7094 < position <= 1007094 [2024-09-21 02:43:15] Get CRAM sample names [2024-09-21 02:43:20] Done getting CRAM sample names [2024-09-21 02:43:20] Generate inputs [2024-09-21 02:43:21] Load and convert CRAM 600 of 767 [2024-09-21 02:43:21] WARNING - sample S.566812105_S.566812105 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:21] WARNING - sample S.568012523_S.568012523 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:21] WARNING - sample S.567512357_S.567512357 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:21] WARNING - sample S.567412325_S.567412325 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:22] WARNING - sample S.568312628_S.568312628 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:22] WARNING - sample S.568412654_S.568412654 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:23] WARNING - sample S.566712069_S.566712069 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.567312273_S.567312273 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.568112542_S.568112542 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.567312269_S.567312269 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.567912497_S.567912497 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:25] WARNING - sample S.568812791_S.568812791 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:25] WARNING - sample S.566712064_S.566712064 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:25] WARNING - sample S.568612727_S.568612727 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:26] WARNING - sample S.567712427_S.567712427 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:26] WARNING - sample S.566411986_S.566411986 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:27] WARNING - sample S.568312621_S.568312621 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:27] WARNING - sample S.568312619_S.568312619 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:28] WARNING - sample S.568712746_S.568712746 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:30] Load and convert CRAM 100 of 767 [2024-09-21 02:43:31] WARNING - sample S.566612045_S.566612045 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:32] WARNING - sample S.567912494_S.567912494 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] Load and convert CRAM 700 of 767 [2024-09-21 02:43:33] WARNING - sample S.568912824_S.568912824 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] WARNING - sample S.566712073_S.566712073 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] WARNING - sample S.568612711_S.568612711 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] WARNING - sample S.568412661_S.568412661 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:34] WARNING - sample S.568812770_S.568812770 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:34] WARNING - sample S.568212601_S.568212601 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:34] Load and convert CRAM 400 of 767 [2024-09-21 02:43:35] WARNING - sample S.566712072_S.566712072 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:36] WARNING - sample S.567512356_S.567512356 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:36] WARNING - sample S.568512667_S.568512667 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:36] WARNING - sample S.568012520_S.568012520 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:37] WARNING - sample S.567712431_S.567712431 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:37] WARNING - sample S.567612390_S.567612390 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:37] WARNING - sample S.567312290_S.567312290 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:39] WARNING - sample S.568312633_S.568312633 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:39] WARNING - sample S.567912473_S.567912473 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:40] WARNING - sample S.568112557_S.568112557 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:40] WARNING - sample S.568012529_S.568012529 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:40] Load and convert CRAM 200 of 767 [2024-09-21 02:43:40] WARNING - sample S.566411977_S.566411977 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.567312278_S.567312278 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.567312286_S.567312286 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.569012834_S.569012834 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.568012517_S.568012517 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:43] Load and convert CRAM 500 of 767 [2024-09-21 02:43:44] WARNING - sample S.567112207_S.567112207 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:45] WARNING - sample S.567912465_S.567912465 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:46] WARNING - sample S.567512347_S.567512347 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:46] WARNING - sample S.567112223_S.567112223 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:46] WARNING - sample S.568812779_S.568812779 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:48] WARNING - sample S.567612375_S.567612375 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:48] WARNING - sample S.568912805_S.568912805 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:50] WARNING - sample S.569012835_S.569012835 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:51] WARNING - sample S.567312293_S.567312293 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:52] WARNING - sample S.568512684_S.568512684 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:52] WARNING - sample S.567312285_S.567312285 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:52] WARNING - sample S.567712403_S.567712403 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:53] WARNING - sample S.568512695_S.568512695 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:53] WARNING - sample S.566912160_S.566912160 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:54] WARNING - sample S.568412634_S.568412634 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:55] Load and convert CRAM 300 of 767 [2024-09-21 02:43:58] Done generating inputs [2024-09-21 02:43:58] Copying files onto tempdir [2024-09-21 02:47:05] Done copying files onto tempdir [2024-09-21 02:47:05] Generate allele count [2024-09-21 02:47:05] Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connectionError in check_mclapply_OK(out2) : |
Such errors arose from reading or writing operations within some of the parallel processes. For debugging:
|
I have been periodically working on this issue. I've excluded all samples that have no informative reads and I'm still getting a crash. Here's the log from a run with
|
Are you able to make a tarball with all the inputs I can try on my side? Say extract all the bams to just the scaffold of interest and I can run it through? Thanks for your continued interest in this despite the crashing! |
The data was too large to upload to GitHub, so I've uploaded the tarball to a pre-signed aws URL. The link will expire in 12 hours, so if you can't download it before then, let me know when you're ready and I'll make another URL. |
@rwdavies I just wanted to check in, were you able to download the data? |
Hi yes I was able to, thanks. I spent a bit of time on it on Friday but ran into some weird C++ errors that I was struggling to reproduce. I need to find time to take another look. Maybe once Zilong is back from vacation I can work on it with him a bit too. |
Great, I understand that the fix may take time, the annoying 12-hour limit just made me worried you didn't get the data. Thanks for looking into it. In case it comes up, this is running stitch installed from conda, specifically |
@rwdavies Just checking in, have you been able to verify the issue? We're trying to decide whether we should have one of our C++ experts take a look. |
Hi @TedBrookings, I am just back from vocation. Could you share your data again? I'll look into it tomorrow. Thanks, |
@Zilong-Li |
Hi @TedBrookings , There are no reads in the region for one of your sample that pass the filters. I pushed a commit that will show the error and stop the program when this happens. c95a7dd. Maybe you can use a bigger region to include more informative reads. Best, |
This region is a small scaffold that only some of the samples map to, so enlarging it isn't possible. Can you tell me which sample has no reads? I am currently just using the latest conda release, so building STITCH from scratch would take me some time. I had attempted to remove all the samples that had no informative reads: I checked that mapping quality >= 17, template length <= 600, and that the read overlapped at least one SNP with a base quality >= 17 and was either REF or ALT. It's possible I made a mistake though, so if you could point out the failing sample (or tell me if I missed a criterion) that might enable me to make a workaround. |
@Zilong-Li I noticed you made a new release that has the error message, as well as some changes to cigar-string parsing that might be relevant. Do you think you could push that release to bioconda? |
Thanks. Will do. I thought it's auto-bumped. |
Hey @TedBrookings , not sure if you find a way to check and filter your data. It's much easier to use samtools or minipileup to quickly investigate if your bam has reads coverage at a specific postion. |
Hi @Zilong-Li, it took awhile to figure out, but the problem was that one of the samples had a too-large template, but I had missed that STITCH defines insert size differently than I expected. I had seen that STITCH iterated over reads individually using the the If you want, you can close this issue now, but I would suggest that at a minimum it would be nice to:
Finally, this is a bigger ask: but I think it would be really nice to have a command-line option to output sample GTs, etc as NOCALL when the sample has no uniformative reads, rather than erroring. It feels like a reasonable use-case to run STITCH on large low-coverage cohorts with ALT scaffolds, or on sub-sections of a contig. In either case, it can happen that some samples may not have informative reads. I've gotten around this by pre-filtering with a python tool that duplicates STITCH's logic for determining if a CRAM/BAM has informative reads, pulling uninformative samples out of the cohort, running STITCH, then running a second tool to re-insert those samples into the cohort with all NOCALL GTs. But this is a lot of work for each individual end user to do, and requires me to keep up-to-date with any changes to how STITCH decides if a read is informative. |
Hi @TedBrookings , Thanks for these nice suggestions! I'll try to implement it when I have time. BTW, the function to get the insert size in STITCH is BamRecord.InsertSize(), which is the observed template length ,instead of the |
I'm working in a genome with a large number of small scaffolds. In a cohort with 767 samples, on one of the scaffolds 106 of the samples have no informative reads. This results in a crash (log below). I'm wondering if it would be possible to add an option to exclude and assign NOCALL GTs to samples with this problem. The alternative seems to be detecting this myself, making a VCF of all no-calls (e.g. with pysam) and merging them.
Run log
The text was updated successfully, but these errors were encountered: