Localize CRAM in the Whamg workflow instead of leveraging streaming from GCS directly #550

VJalili · 2023-06-15T19:28:52Z

This PR updates the RunWhamgOnCram task to localize the CRAM file and its index for compatibility with CoA, as the current data streaming approach from GCS is not cross-cloud portable.

We expect data localization will have a performance and cost impact compared to the current streaming approach. Therefore, I submitted two runs of the GatherSampleEvidenceBatch workflow: streaming (current latest main) and localization (this PR). Both submissions use the same inputs and options, I used the following inputs file.

inputs/build/ref_panel_1kg/test/GatherSampleEvidence/GatherSampleEvidence.json

which contains 155 cram files. Additionally, from the above file, I removed manta_docker and melt_docker hence only whamg was run and manta and melt was skipped as they are not related to the changes introduced in this PR.

	Worflow ID	Cost	Normalized per sample
This PR (localized samples)	`fc0bed95-e32d-4684-b1a6-0decdbabbb9c`	$42.05	$0.27
Latest `main` (stream samples)	`299f98c1-fce3-4268-a8f4-a0966cb57878`	$37.70	$0.24

The two workflows generate comparable outputs; you may take the following steps to test compare some of the generated outputs.

# Get metadata and two of the outputs it generates for 
# the feature branch (this PR).
% cromshell metadata fc0bed95-e32d-4684-b1a6-0decdbabbb9c > fc0bed95-e32d-4684-b1a6-0decdbabbb9c.json
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.wham_vcf"[0]' fc0bed95-e32d-4684-b1a6-0decdbabbb9c.json) feature_test_wham.vcf.gz
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.coverage_counts"[0]' fc0bed95-e32d-4684-b1a6-0decdbabbb9c.json) feature_test_coverage_counts.tsv.gz

# Get metadata and two of the outputs it generates for
# the latest main branch. 
% cromshell metadata 299f98c1-fce3-4268-a8f4-a0966cb57878 > 299f98c1-fce3-4268-a8f4-a0966cb57878.json
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.wham_vcf"[0]' 299f98c1-fce3-4268-a8f4-a0966cb57878.json) main_test_wham.vcf.gz
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.coverage_counts"[0]' 299f98c1-fce3-4268-a8f4-a0966cb57878.json) main_test_coverage_counts.tsv.gz

Compare the files:

% ls -lha *_wham.vcf.gz
-rw-r--r--  1 jvahid  staff   238K Jun 20 09:03 feature_test_wham.vcf.gz
-rw-r--r--  1 jvahid  staff   238K Jun 20 08:57 main_test_wham.vcf.gz

% ls -lha *_counts.tsv.gz
-rw-r--r--  1 jvahid  staff   165M Jun 20 09:09 feature_test_coverage_counts.tsv.gz
-rw-r--r--  1 jvahid  staff   165M Jun 20 09:19 main_test_coverage_counts.tsv.gz

% md5 *_counts.tsv.gz
MD5 (feature_test_coverage_counts.tsv.gz) = a85a83405ba5b9937df673680ab75919
MD5 (main_test_coverage_counts.tsv.gz) =    a85a83405ba5b9937df673680ab75919

Note that I am not comparing the md5 of wham output, because its output are not reproducible. #201

mwalker174 · 2024-01-18T20:14:17Z

@VJalili Is this still needed? I think the LocalizeReads issue was addressed by you here #619.

VJalili · 2024-01-19T00:38:13Z

Thanks for checking this @mwalker174! We will most probably not need this, so I am closing it. However, since it is the same branch I am currently working on (with temp hard coding) to run the batch pipeline on Azure, I may need to create a new PR on this branch after cleaning it up.

Localize files instead of streaming, for CoA compatibility.

7b112a0

VJalili marked this pull request as ready for review June 16, 2023 16:15

VJalili requested a review from mwalker174 June 16, 2023 16:15

VJalili added 3 commits September 15, 2023 08:36

Determine cromwell root using a more portable approach.

da698c7

Localize files instead of streaming, for CoA compatibility.

1b73c16

Localize files instead of streaming, for CoA compatibility.

60f0b9e

VJalili closed this Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Localize CRAM in the Whamg workflow instead of leveraging streaming from GCS directly #550

Localize CRAM in the Whamg workflow instead of leveraging streaming from GCS directly #550

VJalili commented Jun 15, 2023 •

edited

Loading

mwalker174 commented Jan 18, 2024

VJalili commented Jan 19, 2024

Localize CRAM in the Whamg workflow instead of leveraging streaming from GCS directly #550

Localize CRAM in the Whamg workflow instead of leveraging streaming from GCS directly #550

Conversation

VJalili commented Jun 15, 2023 • edited Loading

mwalker174 commented Jan 18, 2024

VJalili commented Jan 19, 2024

VJalili commented Jun 15, 2023 •

edited

Loading