Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to avoid having to Load and convert BAMs..... each run #60

Open
GuyReeves opened this issue Nov 24, 2021 · 8 comments
Open

Comments

@GuyReeves
Copy link

HI

I was wondering if there is a way to avoid having to Load and convert BAMs each run which as I understand it creates the files in the /input and /Rdata folders.
The reason I ask is if I want to try different values of "K" can't I just skip the loading / generating steps?

Thanks

Guy

@rwdavies
Copy link
Owner

Hi Guy,

Yup, see option --regenerateInput, in particular, you can set it to FALSE if you're using the same region name (chr, regionStart, regionEnd), and it will use the processed input files in input/ (and I believe the sample names from a file in RData/). It will over-write output otherwise, so one recommendation I would have would be to rsync the previous outputdir to a new location, then run your new K there, then rsync back, probably to some separate master folder, the output you care about (e.g. the VCF).

Best,
Robbie

@GuyReeves
Copy link
Author

Thanks that is great and very useful.

I read in other posts (see quote below) there is advice about how to select an optimal value for K, can you please point me to this as I cannot find it. Thanks Guy

" I've run the program, varying K and number of generations as suggested, and am now evaluating the output. It seems that the mean score, and number of sites with scores > 0.4 increase as K increases from"

@rwdavies
Copy link
Owner

I have some information on the main page which is probably a good place to start
https://github.com/rwdavies/STITCH#note-on-the-selection-of-k-and-ngen

@GuyReeves
Copy link
Author

GuyReeves commented Nov 30, 2021

Hi

I am trying to use --regenerateInput

Was a bit confused about settings think I have it now

library(STITCH)
STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "sample_list5838.txt" , posfile = "posChr2Lnoindel.txt", outputdir = paste0(getwd(), "/"), K = 5, nGen = 11, nCores = 10, outputSNPBlockSize = 1000, gridWindowSize = 10000, outputInputInVCFFormat = TRUE, regenerateInput = FALSE, originalRegionName = "chr2L", regenerateInputWithDefaultValues = TRUE)

@rwdavies
Copy link
Owner

rwdavies commented Dec 1, 2021

OK so is it working as expected now? If not clear I can make a demonstration test case

@GuyReeves
Copy link
Author

if I include outputInputInVCFFormat = TRUE. this means that only the inputVCF file will be generated i.e. there is no imputation. I need to leave this out if want imputation iterations to occur correct?

Thanks

Guy

Loading required package: parallel
Loading required package: rrbgen
[2021-12-01 12:20:25] Running STITCH(chr = chr2L, nGen = 11, posfile = posChr2Lnoindel.txt, K = 5, S = 1, outputdir = /home/reeves/diary_nov_stitch/k_5list5838/, nStarts = , tempdir = /tmp/Rtmp73QwX2, bamlist = , cramlist = , sampleNames_file = sample_list5838.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = TRUE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = 1, chrEnd = 23506264, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 10, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = FALSE, originalRegionName = chr2L, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 1000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = TRUE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = 10000, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000)
[2021-12-01 12:20:25] Program start
[2021-12-01 12:20:25] Get and validate pos and gen
[2021-12-01 12:20:26] Done get and validate pos and gen
[2021-12-01 12:20:26] Copying files onto tempdir
[2021-12-01 13:32:18] Done copying files onto tempdir
[2021-12-01 13:32:18] Generate allele count
[2021-12-01 14:17:18] Quantiles across SNPs of per-sample depth of coverage
[2021-12-01 14:17:18] 5% 25% 50% 75% 95%
[2021-12-01 14:17:18] 1.162 1.591 1.953 2.407 3.266
[2021-12-01 14:17:18] Done generating allele count
[2021-12-01 14:17:18] Prepare data to use to build vcf from input
[2021-12-01 14:32:57] Write block of VCF for samples:1-584
[2021-12-01 14:36:58] Write block of VCF for samples:585-1168
[2021-12-01 14:40:26] Write block of VCF for samples:1169-1752
[2021-12-01 14:43:38] Write block of VCF for samples:1753-2335
[2021-12-01 14:47:06] Write block of VCF for samples:2336-2919
[2021-12-01 14:50:51] Write block of VCF for samples:2920-3503
[2021-12-01 14:54:59] Write block of VCF for samples:3504-4086
[2021-12-01 15:02:26] Write block of VCF for samples:4087-4670
[2021-12-01 15:15:16] Write block of VCF for samples:4671-5254
[2021-12-01 15:56:43] Write block of VCF for samples:5255-5838
[2021-12-01 16:31:53] Build VCF from input
[2021-12-01 16:39:16] Done building VCF from input

@rwdavies
Copy link
Owner

rwdavies commented Dec 3, 2021

Yes, that is to just make an output VCF that only contains values informed by the input information, with no imputation. Useful in particular if you want to compare results vs say Beagle which takes VCF input. I think that was why I wrote that option

@GuyReeves
Copy link
Author

Hi
Can I check that if I run with --regenerateInput =' False the files in the input/ and RData/ folders do not use any parameters specified in the command, and that these two folders can be reused in any other combination of commands including e.g. -reference_haplotype_file...

Thanks Guy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants