Is there a way to avoid having to Load and convert BAMs..... each run #60

GuyReeves · 2021-11-24T12:50:11Z

HI

I was wondering if there is a way to avoid having to Load and convert BAMs each run which as I understand it creates the files in the /input and /Rdata folders.
The reason I ask is if I want to try different values of "K" can't I just skip the loading / generating steps?

Thanks

Guy

rwdavies · 2021-11-25T08:57:30Z

Hi Guy,

Yup, see option --regenerateInput, in particular, you can set it to FALSE if you're using the same region name (chr, regionStart, regionEnd), and it will use the processed input files in input/ (and I believe the sample names from a file in RData/). It will over-write output otherwise, so one recommendation I would have would be to rsync the previous outputdir to a new location, then run your new K there, then rsync back, probably to some separate master folder, the output you care about (e.g. the VCF).

Best,
Robbie

GuyReeves · 2021-11-26T12:59:32Z

Thanks that is great and very useful.

I read in other posts (see quote below) there is advice about how to select an optimal value for K, can you please point me to this as I cannot find it. Thanks Guy

" I've run the program, varying K and number of generations as suggested, and am now evaluating the output. It seems that the mean score, and number of sites with scores > 0.4 increase as K increases from"

rwdavies · 2021-11-29T11:58:28Z

I have some information on the main page which is probably a good place to start
https://github.com/rwdavies/STITCH#note-on-the-selection-of-k-and-ngen

GuyReeves · 2021-11-30T19:37:37Z

Hi

I am trying to use --regenerateInput

Was a bit confused about settings think I have it now

library(STITCH)
STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "sample_list5838.txt" , posfile = "posChr2Lnoindel.txt", outputdir = paste0(getwd(), "/"), K = 5, nGen = 11, nCores = 10, outputSNPBlockSize = 1000, gridWindowSize = 10000, outputInputInVCFFormat = TRUE, regenerateInput = FALSE, originalRegionName = "chr2L", regenerateInputWithDefaultValues = TRUE)

rwdavies · 2021-12-01T15:34:20Z

OK so is it working as expected now? If not clear I can make a demonstration test case

GuyReeves · 2021-12-01T16:01:08Z

if I include outputInputInVCFFormat = TRUE. this means that only the inputVCF file will be generated i.e. there is no imputation. I need to leave this out if want imputation iterations to occur correct?

Thanks

Guy

Loading required package: parallel
Loading required package: rrbgen
[2021-12-01 12:20:25] Running STITCH(chr = chr2L, nGen = 11, posfile = posChr2Lnoindel.txt, K = 5, S = 1, outputdir = /home/reeves/diary_nov_stitch/k_5list5838/, nStarts = , tempdir = /tmp/Rtmp73QwX2, bamlist = , cramlist = , sampleNames_file = sample_list5838.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = TRUE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = 1, chrEnd = 23506264, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 10, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = FALSE, originalRegionName = chr2L, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 1000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = TRUE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = 10000, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000)
[2021-12-01 12:20:25] Program start
[2021-12-01 12:20:25] Get and validate pos and gen
[2021-12-01 12:20:26] Done get and validate pos and gen
[2021-12-01 12:20:26] Copying files onto tempdir
[2021-12-01 13:32:18] Done copying files onto tempdir
[2021-12-01 13:32:18] Generate allele count
[2021-12-01 14:17:18] Quantiles across SNPs of per-sample depth of coverage
[2021-12-01 14:17:18] 5% 25% 50% 75% 95%
[2021-12-01 14:17:18] 1.162 1.591 1.953 2.407 3.266
[2021-12-01 14:17:18] Done generating allele count
[2021-12-01 14:17:18] Prepare data to use to build vcf from input
[2021-12-01 14:32:57] Write block of VCF for samples:1-584
[2021-12-01 14:36:58] Write block of VCF for samples:585-1168
[2021-12-01 14:40:26] Write block of VCF for samples:1169-1752
[2021-12-01 14:43:38] Write block of VCF for samples:1753-2335
[2021-12-01 14:47:06] Write block of VCF for samples:2336-2919
[2021-12-01 14:50:51] Write block of VCF for samples:2920-3503
[2021-12-01 14:54:59] Write block of VCF for samples:3504-4086
[2021-12-01 15:02:26] Write block of VCF for samples:4087-4670
[2021-12-01 15:15:16] Write block of VCF for samples:4671-5254
[2021-12-01 15:56:43] Write block of VCF for samples:5255-5838
[2021-12-01 16:31:53] Build VCF from input
[2021-12-01 16:39:16] Done building VCF from input

rwdavies · 2021-12-03T12:04:26Z

Yes, that is to just make an output VCF that only contains values informed by the input information, with no imputation. Useful in particular if you want to compare results vs say Beagle which takes VCF input. I think that was why I wrote that option

GuyReeves · 2021-12-16T14:39:18Z

Hi
Can I check that if I run with --regenerateInput =' False the files in the input/ and RData/ folders do not use any parameters specified in the command, and that these two folders can be reused in any other combination of commands including e.g. -reference_haplotype_file...

Thanks Guy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to avoid having to Load and convert BAMs..... each run #60

Is there a way to avoid having to Load and convert BAMs..... each run #60

GuyReeves commented Nov 24, 2021

rwdavies commented Nov 25, 2021

GuyReeves commented Nov 26, 2021

rwdavies commented Nov 29, 2021

GuyReeves commented Nov 30, 2021 •

edited

Loading

rwdavies commented Dec 1, 2021

GuyReeves commented Dec 1, 2021

rwdavies commented Dec 3, 2021

GuyReeves commented Dec 16, 2021

Is there a way to avoid having to Load and convert BAMs..... each run #60

Is there a way to avoid having to Load and convert BAMs..... each run #60

Comments

GuyReeves commented Nov 24, 2021

rwdavies commented Nov 25, 2021

GuyReeves commented Nov 26, 2021

rwdavies commented Nov 29, 2021

GuyReeves commented Nov 30, 2021 • edited Loading

rwdavies commented Dec 1, 2021

GuyReeves commented Dec 1, 2021

rwdavies commented Dec 3, 2021

GuyReeves commented Dec 16, 2021

GuyReeves commented Nov 30, 2021 •

edited

Loading