Localize CRAM in the Whamg workflow instead of leveraging streaming from GCS directly #550
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR updates the
RunWhamgOnCram
task to localize the CRAM file and its index for compatibility with CoA, as the current data streaming approach from GCS is not cross-cloud portable.We expect data localization will have a performance and cost impact compared to the current streaming approach. Therefore, I submitted two runs of the
GatherSampleEvidenceBatch
workflow: streaming (current latestmain
) and localization (this PR). Both submissions use the same inputs and options, I used the following inputs file.which contains
155
cram files. Additionally, from the above file, I removedmanta_docker
andmelt_docker
hence onlywhamg
was run andmanta
andmelt
was skipped as they are not related to the changes introduced in this PR.fc0bed95-e32d-4684-b1a6-0decdbabbb9c
main
(stream samples)299f98c1-fce3-4268-a8f4-a0966cb57878
The two workflows generate comparable outputs; you may take the following steps to test compare some of the generated outputs.
Compare the files:
Note that I am not comparing the md5 of wham output, because its output are not reproducible. #201