Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when many samples have no informative reads #102

Open
TedBrookings opened this issue Sep 17, 2024 · 25 comments
Open

Crash when many samples have no informative reads #102

TedBrookings opened this issue Sep 17, 2024 · 25 comments

Comments

@TedBrookings
Copy link

I'm working in a genome with a large number of small scaffolds. In a cohort with 767 samples, on one of the scaffolds 106 of the samples have no informative reads. This results in a crash (log below). I'm wondering if it would be possible to add an option to exclude and assign NOCALL GTs to samples with this problem. The alternative seems to be detecting this myself, making a VCF of all no-calls (e.g. with pysam) and merging them.

Run log
+ STITCH.R --chr=Scaffold05208 --regionStart=64 --regionEnd=762 --buffer=1000000 --cramlist crams.list \
    --posfile=tmp_region_positions.pos --K=10 --nGen=20 --nCores=64 --refillIterations=NA --downsampleToCov=50 \
    --outputdir=. --splitReadIterations=NA --reference ref.dna.fa --tempdir . --keepSampleReadsInRAM=TRUE \
    --output_filename stitch.region-03721.vcf.gz
[2024-09-16 23:12:02] Running STITCH(chr = Scaffold05208, nGen = 20, posfile = tmp_region_positions.pos, K = 10, S = 1, outputdir = ., nStarts = , tempdir = ., bamlist = , cramlist = crams.list, sampleNames_file = , reference = ref.dna.fa, genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 64, regionEnd = 762, buffer = 1000000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = NA, nCores = 64, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = stitch.region-03721.vcf.gz, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = TRUE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000)
[2024-09-16 23:12:02] Program start
[2024-09-16 23:12:02] Get and validate pos and gen
[2024-09-16 23:12:02] Done get and validate pos and gen
[2024-09-16 23:12:02] There are 0 variants in the left buffer region -999936 <= position < 64
[2024-09-16 23:12:02] There are 106 variants in the central region 64 <= position <= 762
[2024-09-16 23:12:02] There are 0 variants in the right buffer region 762 < position <= 1000762
[2024-09-16 23:12:02] Get CRAM sample names
[2024-09-16 23:12:06] Done getting CRAM sample names
[2024-09-16 23:12:06] Generate inputs
[2024-09-16 23:12:06] Load and convert CRAM 600 of 767
[2024-09-16 23:12:06] WARNING - sample S.566812105_S.566812105 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:06] WARNING - sample S.567712412_S.567712412 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:06] WARNING - sample S.567512357_S.567512357 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:07] WARNING - sample S.567312269_S.567312269 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:07] WARNING - sample S.567412325_S.567412325 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:07] WARNING - sample S.568312628_S.568312628 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:07] WARNING - sample S.567212244_S.567212244 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:08] WARNING - sample S.566612041_S.566612041 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:08] WARNING - sample S.568412643_S.568412643 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:08] WARNING - sample S.568312619_S.568312619 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:08] WARNING - sample S.568812799_S.568812799 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:09] WARNING - sample S.566712069_S.566712069 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:09] WARNING - sample S.567012167_S.567012167 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:09] WARNING - sample S.567312273_S.567312273 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:10] WARNING - sample S.566712078_S.566712078 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:10] WARNING - sample S.567912501_S.567912501 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:10] WARNING - sample S.568112542_S.568112542 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:10] WARNING - sample S.568412654_S.568412654 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:10] WARNING - sample S.567712427_S.567712427 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:11] WARNING - sample S.568212572_S.568212572 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:11] WARNING - sample S.566812120_S.566812120 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:11] WARNING - sample S.568512679_S.568512679 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:11] Load and convert CRAM 100 of 767
[2024-09-16 23:12:11] WARNING - sample S.566612045_S.566612045 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:11] WARNING - sample S.568312621_S.568312621 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:12] WARNING - sample S.568612727_S.568612727 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:12] WARNING - sample S.568812791_S.568812791 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:14] WARNING - sample S.566912136_S.566912136 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:15] WARNING - sample S.568912824_S.568912824 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:15] WARNING - sample S.567012184_S.567012184 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:15] WARNING - sample S.568712754_S.568712754 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:16] WARNING - sample S.568412661_S.568412661 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:16] WARNING - sample S.568112552_S.568112552 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:17] WARNING - sample S.568812802_S.568812802 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:17] WARNING - sample S.567912496_S.567912496 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:17] WARNING - sample S.567712416_S.567712416 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:17] WARNING - sample S.566712073_S.566712073 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:17] Load and convert CRAM 400 of 767
[2024-09-16 23:12:18] WARNING - sample S.567012185_S.567012185 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:18] WARNING - sample S.568812770_S.568812770 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:18] WARNING - sample S.567012189_S.567012189 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:19] WARNING - sample S.566712072_S.566712072 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:19] WARNING - sample S.567612390_S.567612390 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:20] WARNING - sample S.567512356_S.567512356 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:20] WARNING - sample S.568512667_S.568512667 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:21] WARNING - sample S.568712744_S.568712744 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:21] WARNING - sample S.569012836_S.569012836 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:21] WARNING - sample S.566912135_S.566912135 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:21] Load and convert CRAM 700 of 767
[2024-09-16 23:12:21] WARNING - sample S.568312611_S.568312611 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:21] WARNING - sample S.567612380_S.567612380 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:21] WARNING - sample S.567312290_S.567312290 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:22] WARNING - sample S.568212587_S.568212587 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:22] WARNING - sample S.567712431_S.567712431 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:22] WARNING - sample S.568912804_S.568912804 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:22] WARNING - sample S.567312286_S.567312286 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:23] WARNING - sample S.567312278_S.567312278 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:23] WARNING - sample S.567912473_S.567912473 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:23] WARNING - sample S.567012193_S.567012193 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:23] WARNING - sample S.566612053_S.566612053 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:24] WARNING - sample S.568312633_S.568312633 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:24] WARNING - sample S.568012520_S.568012520 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:24] WARNING - sample S.568212581_S.568212581 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:24] WARNING - sample S.567112207_S.567112207 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:25] WARNING - sample S.568112547_S.568112547 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:25] WARNING - sample S.567012199_S.567012199 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:25] WARNING - sample S.568412647_S.568412647 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:25] WARNING - sample S.569012834_S.569012834 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:25] WARNING - sample S.568112557_S.568112557 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:25] Load and convert CRAM 200 of 767
[2024-09-16 23:12:26] WARNING - sample S.568012517_S.568012517 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:27] WARNING - sample S.568912828_S.568912828 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:27] WARNING - sample S.568912825_S.568912825 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:27] WARNING - sample S.568212577_S.568212577 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:27] WARNING - sample S.568912805_S.568912805 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:27] WARNING - sample S.567112208_S.567112208 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:28] WARNING - sample S.567912495_S.567912495 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:28] WARNING - sample S.566411977_S.566411977 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:28] Load and convert CRAM 500 of 767
[2024-09-16 23:12:28] WARNING - sample S.567712423_S.567712423 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:28] WARNING - sample S.567012169_S.567012169 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:28] WARNING - sample S.568512687_S.568512687 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:28] WARNING - sample S.566411964_S.566411964 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:29] WARNING - sample S.568112554_S.568112554 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:29] WARNING - sample S.567112209_S.567112209 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:29] WARNING - sample S.567312285_S.567312285 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:30] WARNING - sample S.567012194_S.567012194 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:30] WARNING - sample S.566712076_S.566712076 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:30] WARNING - sample S.567912464_S.567912464 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:30] WARNING - sample S.566812123_S.566812123 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:31] WARNING - sample S.567612382_S.567612382 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:31] WARNING - sample S.567112223_S.567112223 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:31] WARNING - sample S.567012165_S.567012165 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:31] WARNING - sample S.568312614_S.568312614 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:32] WARNING - sample S.568212575_S.568212575 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:32] WARNING - sample S.567312293_S.567312293 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:33] WARNING - sample S.568812795_S.568812795 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:33] WARNING - sample S.569012835_S.569012835 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:35] WARNING - sample S.568512664_S.568512664 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:35] WARNING - sample S.567612375_S.567612375 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:35] WARNING - sample S.568812779_S.568812779 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:37] WARNING - sample S.569012844_S.569012844 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:37] WARNING - sample S.568512684_S.568512684 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:38] WARNING - sample S.567712403_S.567712403 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:38] Load and convert CRAM 300 of 767
[2024-09-16 23:12:38] WARNING - sample S.568012527_S.568012527 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:38] WARNING - sample S.568512695_S.568512695 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:38] WARNING - sample S.566912160_S.566912160 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:38] WARNING - sample S.567312287_S.567312287 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:39] WARNING - sample S.568612729_S.568612729 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:39] WARNING - sample S.568212582_S.568212582 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:39] WARNING - sample S.568912832_S.568912832 has no informative reads. It is being given random reads. Consider removing from analysis
[2024-09-16 23:12:41] Done generating inputs
[2024-09-16 23:12:41] Copying files onto tempdir
[2024-09-16 23:15:35] Done copying files onto tempdir
[2024-09-16 23:15:35] Begin loading all sample reads into memory
[2024-09-16 23:15:36] Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection

Error in check_mclapply_OK(out) : 
  An error occured during STITCH. The first such error is above
Calls: STITCH ... load_all_sampleReads_into_memory -> check_mclapply_OK
In addition: Warning messages:
1: In mclapply(1:length(sampleRanges), mc.cores = nCores, FUN = loadBamAndConvert_across_a_range,  :
  scheduled core 10 encountered error in user code, all values of the job will be affected
2: In mclapply(sampleRanges, mc.cores = nCores, FUN = function(sampleRange) { :
  scheduled core 10 encountered error in user code, all values of the job will be affected
Execution halted
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
@Zilong-Li
Copy link
Collaborator

Hey,

is this scalffold of 700 bases long ? Given the command --chr=Scaffold05208 --regionStart=64 --regionEnd=762, I don't think STITCH can work

@TedBrookings
Copy link
Author

The scaffold is very short, but this command is running in a workflow that chops up regions by where there are SNPs, so the scaffold is slightly longer (867 bases). Is there some rule of thumb I can use to understand when/why STITCH will not be able to work on small scaffolds or short regions?

@Zilong-Li
Copy link
Collaborator

STITCH should be able to generate non-called GTs. I suppose this error may be due to the parallel function. Try to reduce the number of cores to 1.

@rwdavies
Copy link
Owner

So my solution for this was never very elegant. The code is here to make fake sample reads when the sample has no reads that intersect SNPs in the region of interest
https://github.com/rwdavies/STITCH/blob/f579464cff8380afcfb62804f0cb9a4ff5dabe67/STITCH/R/functions.R#L2039C1-L2039C22
It's not actually random noise that these samples are given, t's just very weak signal for the first three SNPs only. In any decently sized region, it should be almost no information. I suppose I could give it a read that intersects the first SNP only and shows equal opposite evidence for the reference and alternate allele. That would be smarter.

Are there perhaps 3 or fewer SNPs in this scaffolds?

Otherwise I would suggest a simple rule might be to run regions (e.g. scaffolds) with more than 5,000 bases. I don't have a strong basis for this number it's just that to make the imputation efficient with low coverage data the region should be long enough to leverage information across different samples.

If you reproducibly hit this error > 3 SNPs and > 5000 bases let us know, maybe something else is going on

@TedBrookings
Copy link
Author

There are 106 SNPs, although many fewer than 5000 bases. I'll restrict to scaffolds with >5000 bases and see if that eliminates this kind of crash. If that works, I guess I could also try reducing to 1 core for those regions and see if they are able to run successfully.

@rwdavies
Copy link
Owner

There are 106 SNPs in (762 - 64) bases? That doesn't obviously explain any error but that is a super high heterozygosity if these are true positive SNPs

Are any of them at the start or end of the region exactly? I think I've written tests against those edge cases but wonder if something is still going on

Another thing, you could set tempdir to be something you have access to, and check after a run fails, if any of the "input" files are missing. STITCH works by first processing the BAMs to get a representation of the data in an internal folder stored in the "input" folder, whose names should correpond in an obvious way to the samples being imputed

@TedBrookings
Copy link
Author

TedBrookings commented Sep 18, 2024

It's a plant genome, so I wouldn't be shocked if some of this was just bad reference. The regions are chosen by a program that breaks them up by SNPs, so yes, first and last base have a SNP in them. I can put a buffer in though so that isn't the case in the future. (I am running with a large value for --buffer).

@TedBrookings
Copy link
Author

A quick follow up: I restricted to regions > 5000bp and more than 6 SNPs, but I am still seeing crashes when many of the CRAMs have no informative reads in an area. The log below shows a crash in a region with 241 variants in 7025 bp:

log + STITCH.R --chr=Scaffold01615 --regionStart=69 --regionEnd=7094 --buffer=1000000 --cramlist tmp_cramlist --posfile=tmp_region_positions.pos --K=10 --nGen=20 --nCores=64 --refillIterations=NA --downsampleToCov=50 --outputdir=. --splitReadIterations=NA --reference ref.fa --tempdir . --keepSampleReadsInRAM=FALSE --outputSNPBlockSize 1000 --gridWindowSize 1000 --output_filename stitch.region-871.vcf.gz [2024-09-21 02:43:15] Running STITCH(chr = Scaffold01615, nGen = 20, posfile = tmp_region_positions.pos, K = 10, S = 1, outputdir = ., nStarts = , tempdir = ., bamlist = , cramlist = tmp_cramlist, sampleNames_file = , reference = Brassica_oleracea.BOL.dna.toplevel.fa, genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 69, regionEnd = 7094, buffer = 1000000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = NA, nCores = 64, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 1000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = stitch.region-871.vcf.gz, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = 1000, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000) [2024-09-21 02:43:15] Program start [2024-09-21 02:43:15] Get and validate pos and gen [2024-09-21 02:43:15] Done get and validate pos and gen [2024-09-21 02:43:15] There are 0 variants in the left buffer region -999931 <= position < 69 [2024-09-21 02:43:15] There are 241 variants in the central region 69 <= position <= 7094 [2024-09-21 02:43:15] There are 0 variants in the right buffer region 7094 < position <= 1007094 [2024-09-21 02:43:15] Get CRAM sample names [2024-09-21 02:43:20] Done getting CRAM sample names [2024-09-21 02:43:20] Generate inputs [2024-09-21 02:43:21] Load and convert CRAM 600 of 767 [2024-09-21 02:43:21] WARNING - sample S.566812105_S.566812105 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:21] WARNING - sample S.568012523_S.568012523 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:21] WARNING - sample S.567512357_S.567512357 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:21] WARNING - sample S.567412325_S.567412325 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:22] WARNING - sample S.568312628_S.568312628 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:22] WARNING - sample S.568412654_S.568412654 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:23] WARNING - sample S.566712069_S.566712069 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.567312273_S.567312273 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.568112542_S.568112542 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.567312269_S.567312269 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:24] WARNING - sample S.567912497_S.567912497 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:25] WARNING - sample S.568812791_S.568812791 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:25] WARNING - sample S.566712064_S.566712064 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:25] WARNING - sample S.568612727_S.568612727 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:26] WARNING - sample S.567712427_S.567712427 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:26] WARNING - sample S.566411986_S.566411986 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:27] WARNING - sample S.568312621_S.568312621 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:27] WARNING - sample S.568312619_S.568312619 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:28] WARNING - sample S.568712746_S.568712746 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:30] Load and convert CRAM 100 of 767 [2024-09-21 02:43:31] WARNING - sample S.566612045_S.566612045 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:32] WARNING - sample S.567912494_S.567912494 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] Load and convert CRAM 700 of 767 [2024-09-21 02:43:33] WARNING - sample S.568912824_S.568912824 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] WARNING - sample S.566712073_S.566712073 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] WARNING - sample S.568612711_S.568612711 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:33] WARNING - sample S.568412661_S.568412661 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:34] WARNING - sample S.568812770_S.568812770 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:34] WARNING - sample S.568212601_S.568212601 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:34] Load and convert CRAM 400 of 767 [2024-09-21 02:43:35] WARNING - sample S.566712072_S.566712072 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:36] WARNING - sample S.567512356_S.567512356 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:36] WARNING - sample S.568512667_S.568512667 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:36] WARNING - sample S.568012520_S.568012520 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:37] WARNING - sample S.567712431_S.567712431 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:37] WARNING - sample S.567612390_S.567612390 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:37] WARNING - sample S.567312290_S.567312290 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:39] WARNING - sample S.568312633_S.568312633 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:39] WARNING - sample S.567912473_S.567912473 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:40] WARNING - sample S.568112557_S.568112557 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:40] WARNING - sample S.568012529_S.568012529 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:40] Load and convert CRAM 200 of 767 [2024-09-21 02:43:40] WARNING - sample S.566411977_S.566411977 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.567312278_S.567312278 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.567312286_S.567312286 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.569012834_S.569012834 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:41] WARNING - sample S.568012517_S.568012517 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:43] Load and convert CRAM 500 of 767 [2024-09-21 02:43:44] WARNING - sample S.567112207_S.567112207 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:45] WARNING - sample S.567912465_S.567912465 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:46] WARNING - sample S.567512347_S.567512347 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:46] WARNING - sample S.567112223_S.567112223 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:46] WARNING - sample S.568812779_S.568812779 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:48] WARNING - sample S.567612375_S.567612375 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:48] WARNING - sample S.568912805_S.568912805 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:50] WARNING - sample S.569012835_S.569012835 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:51] WARNING - sample S.567312293_S.567312293 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:52] WARNING - sample S.568512684_S.568512684 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:52] WARNING - sample S.567312285_S.567312285 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:52] WARNING - sample S.567712403_S.567712403 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:53] WARNING - sample S.568512695_S.568512695 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:53] WARNING - sample S.566912160_S.566912160 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:54] WARNING - sample S.568412634_S.568412634 has no informative reads. It is being given random reads. Consider removing from analysis [2024-09-21 02:43:55] Load and convert CRAM 300 of 767 [2024-09-21 02:43:58] Done generating inputs [2024-09-21 02:43:58] Copying files onto tempdir [2024-09-21 02:47:05] Done copying files onto tempdir [2024-09-21 02:47:05] Generate allele count [2024-09-21 02:47:05] Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection

Error in check_mclapply_OK(out2) :
An error occured during STITCH. The first such error is above
Calls: STITCH -> buildAlleleCount -> check_mclapply_OK
In addition: Warning messages:
1: In mclapply(1:length(sampleRanges), mc.cores = nCores, FUN = loadBamAndConvert_across_a_range, :
scheduled core 45 encountered error in user code, all values of the job will be affected
2: In mclapply(sampleRanges, mc.cores = nCores, FUN = buildAlleleCount_subfunction, :
scheduled core 45 encountered error in user code, all values of the job will be affected
Execution halted

@Zilong-Li
Copy link
Collaborator

Such errors arose from reading or writing operations within some of the parallel processes. For debugging:

  1. Make sure I/O is okay.
  2. Set nCores=1

@TedBrookings
Copy link
Author

I have been periodically working on this issue. I've excluded all samples that have no informative reads and I'm still getting a crash. Here's the log from a run with nCores=1 that crashes:

STITCH.R --chr=Scaffold01526 --regionStart=180 --regionEnd=8540 --buffer=1000000 --cramlist tmp_cramlist --posfile=tmp_region_positions.pos --K=10 --nGen=20 --nCores=1 --refillIterations=NA --downsampleToCov=50 --outputdir=. --splitReadIterations=NA --reference ref.fa --tempdir . --keepSampleReadsInRAM=FALSE --outputSNPBlockSize 1000 --gridWindowSize 1000 --output_filename tmp.stitch.region-5.vcf.gz
[2024-10-24 17:42:26] Running STITCH(chr = Scaffold01526, nGen = 20, posfile = tmp_region_positions.pos, K = 10, S = 1, outputdir = ., nStarts = , tempdir = ., bamlist = , cramlist = tmp_cramlist, sampleNames_file = , reference = Brassica_oleracea.BOL.dna.toplevel.fa, genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 180, regionEnd = 8540, buffer = 1000000, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = NA, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 1000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = tmp.stitch.region-5.vcf.gz, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = 1000, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000)
[2024-10-24 17:42:26] Program start
[2024-10-24 17:42:26] Get and validate pos and gen
[2024-10-24 17:42:26] Done get and validate pos and gen
[2024-10-24 17:42:26] There are 0 variants in the left buffer region -999820 <= position < 180
[2024-10-24 17:42:26] There are 108 variants in the central region 180 <= position <= 8540
[2024-10-24 17:42:26] There are 0 variants in the right buffer region 8540 < position <= 1008540
[2024-10-24 17:42:26] Get CRAM sample names
[2024-10-24 17:43:42] Done getting CRAM sample names
[2024-10-24 17:43:42] Generate inputs
[2024-10-24 17:43:55] Load and convert CRAM 100 of 494
Error in sum(x) : invalid 'type' (list) of argument
Calls: STITCH ... loadBamAndConvert -> merge_reads_from_sampleReadsRaw
Execution halted

@rwdavies
Copy link
Owner

Are you able to make a tarball with all the inputs I can try on my side? Say extract all the bams to just the scaffold of interest and I can run it through?

Thanks for your continued interest in this despite the crashing!

@TedBrookings
Copy link
Author

The data was too large to upload to GitHub, so I've uploaded the tarball to a pre-signed aws URL.

The link will expire in 12 hours, so if you can't download it before then, let me know when you're ready and I'll make another URL.

@TedBrookings
Copy link
Author

@rwdavies I just wanted to check in, were you able to download the data?

@rwdavies
Copy link
Owner

Hi yes I was able to, thanks. I spent a bit of time on it on Friday but ran into some weird C++ errors that I was struggling to reproduce. I need to find time to take another look. Maybe once Zilong is back from vacation I can work on it with him a bit too.

@TedBrookings
Copy link
Author

Great, I understand that the fix may take time, the annoying 12-hour limit just made me worried you didn't get the data. Thanks for looking into it. In case it comes up, this is running stitch installed from conda, specifically bioconda::r-stitch=1.7.1

@TedBrookings
Copy link
Author

@rwdavies Just checking in, have you been able to verify the issue? We're trying to decide whether we should have one of our C++ experts take a look.

@Zilong-Li
Copy link
Collaborator

Hi @TedBrookings,

I am just back from vocation. Could you share your data again? I'll look into it tomorrow.

Thanks,
Zilong.

@TedBrookings
Copy link
Author

@Zilong-Li
Here's a new presigned URL

@Zilong-Li
Copy link
Collaborator

Hi @TedBrookings ,

There are no reads in the region for one of your sample that pass the filters. I pushed a commit that will show the error and stop the program when this happens. c95a7dd. Maybe you can use a bigger region to include more informative reads.

Best,
Zilong.

@TedBrookings
Copy link
Author

This region is a small scaffold that only some of the samples map to, so enlarging it isn't possible. Can you tell me which sample has no reads? I am currently just using the latest conda release, so building STITCH from scratch would take me some time.

I had attempted to remove all the samples that had no informative reads: I checked that mapping quality >= 17, template length <= 600, and that the read overlapped at least one SNP with a base quality >= 17 and was either REF or ALT. It's possible I made a mistake though, so if you could point out the failing sample (or tell me if I missed a criterion) that might enable me to make a workaround.

@TedBrookings
Copy link
Author

@Zilong-Li I noticed you made a new release that has the error message, as well as some changes to cigar-string parsing that might be relevant. Do you think you could push that release to bioconda?

@Zilong-Li
Copy link
Collaborator

Thanks. Will do. I thought it's auto-bumped.

@Zilong-Li
Copy link
Collaborator

Hey @TedBrookings ,

not sure if you find a way to check and filter your data. It's much easier to use samtools or minipileup to quickly investigate if your bam has reads coverage at a specific postion.

@TedBrookings
Copy link
Author

Hi @Zilong-Li, it took awhile to figure out, but the problem was that one of the samples had a too-large template, but I had missed that STITCH defines insert size differently than I expected. I had seen that STITCH iterated over reads individually using the the HTSlib API, but missed that it loads BAMs/CRAMs with SeqLib, which sets insert size to abs(pos - mate_pos) + num_consumed_bases_this_read.
Once I adjusted for this, I was able to pre-filter samples and get STITCH to run successfully.

If you want, you can close this issue now, but I would suggest that at a minimum it would be nice to:

  1. Change the error message to indicate which sample(s) are uninformative. As is, I had to manually bisection search hundreds of samples, excluding samples until I found the single failing sample.
  2. Add to user-facing documentation (e.g. here which definition of insert size you are using. It's understandable that there isn't any great single-read definition of insert size, but this definition is a bit unexpected. For example, my data set has a UMI on read1, so read1 and read2 will have different insert size.

Finally, this is a bigger ask: but I think it would be really nice to have a command-line option to output sample GTs, etc as NOCALL when the sample has no uniformative reads, rather than erroring. It feels like a reasonable use-case to run STITCH on large low-coverage cohorts with ALT scaffolds, or on sub-sections of a contig. In either case, it can happen that some samples may not have informative reads. I've gotten around this by pre-filtering with a python tool that duplicates STITCH's logic for determining if a CRAM/BAM has informative reads, pulling uninformative samples out of the cohort, running STITCH, then running a second tool to re-insert those samples into the cohort with all NOCALL GTs. But this is a lot of work for each individual end user to do, and requires me to keep up-to-date with any changes to how STITCH decides if a read is informative.

@Zilong-Li
Copy link
Collaborator

Hi @TedBrookings ,

Thanks for these nice suggestions! I'll try to implement it when I have time.

BTW, the function to get the insert size in STITCH is BamRecord.InsertSize(), which is the observed template length ,instead of the FullInsertSize as you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants