Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Localize CRAM in the Whamg workflow instead of leveraging streaming from GCS directly #550

Closed
wants to merge 4 commits into from

Conversation

VJalili
Copy link
Member

@VJalili VJalili commented Jun 15, 2023

This PR updates the RunWhamgOnCram task to localize the CRAM file and its index for compatibility with CoA, as the current data streaming approach from GCS is not cross-cloud portable.

We expect data localization will have a performance and cost impact compared to the current streaming approach. Therefore, I submitted two runs of the GatherSampleEvidenceBatch workflow: streaming (current latest main) and localization (this PR). Both submissions use the same inputs and options, I used the following inputs file.

inputs/build/ref_panel_1kg/test/GatherSampleEvidence/GatherSampleEvidence.json

which contains 155 cram files. Additionally, from the above file, I removed manta_docker and melt_docker hence only whamg was run and manta and melt was skipped as they are not related to the changes introduced in this PR.

Worflow ID Cost Normalized per sample
This PR (localized samples) fc0bed95-e32d-4684-b1a6-0decdbabbb9c $42.05 $0.27
Latest main (stream samples) 299f98c1-fce3-4268-a8f4-a0966cb57878 $37.70 $0.24

The two workflows generate comparable outputs; you may take the following steps to test compare some of the generated outputs.

# Get metadata and two of the outputs it generates for 
# the feature branch (this PR).
% cromshell metadata fc0bed95-e32d-4684-b1a6-0decdbabbb9c > fc0bed95-e32d-4684-b1a6-0decdbabbb9c.json
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.wham_vcf"[0]' fc0bed95-e32d-4684-b1a6-0decdbabbb9c.json) feature_test_wham.vcf.gz
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.coverage_counts"[0]' fc0bed95-e32d-4684-b1a6-0decdbabbb9c.json) feature_test_coverage_counts.tsv.gz

# Get metadata and two of the outputs it generates for
# the latest main branch. 
% cromshell metadata 299f98c1-fce3-4268-a8f4-a0966cb57878 > 299f98c1-fce3-4268-a8f4-a0966cb57878.json
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.wham_vcf"[0]' 299f98c1-fce3-4268-a8f4-a0966cb57878.json) main_test_wham.vcf.gz
% gsutil cp $(jq -r '.outputs."GatherSampleEvidenceBatch.coverage_counts"[0]' 299f98c1-fce3-4268-a8f4-a0966cb57878.json) main_test_coverage_counts.tsv.gz

Compare the files:

% ls -lha *_wham.vcf.gz
-rw-r--r--  1 jvahid  staff   238K Jun 20 09:03 feature_test_wham.vcf.gz
-rw-r--r--  1 jvahid  staff   238K Jun 20 08:57 main_test_wham.vcf.gz

% ls -lha *_counts.tsv.gz
-rw-r--r--  1 jvahid  staff   165M Jun 20 09:09 feature_test_coverage_counts.tsv.gz
-rw-r--r--  1 jvahid  staff   165M Jun 20 09:19 main_test_coverage_counts.tsv.gz

% md5 *_counts.tsv.gz
MD5 (feature_test_coverage_counts.tsv.gz) = a85a83405ba5b9937df673680ab75919
MD5 (main_test_coverage_counts.tsv.gz) =    a85a83405ba5b9937df673680ab75919

Note that I am not comparing the md5 of wham output, because its output are not reproducible. #201

@VJalili VJalili marked this pull request as ready for review June 16, 2023 16:15
@VJalili VJalili requested a review from mwalker174 June 16, 2023 16:15
@mwalker174
Copy link
Collaborator

@VJalili Is this still needed? I think the LocalizeReads issue was addressed by you here #619.

@VJalili
Copy link
Member Author

VJalili commented Jan 19, 2024

Thanks for checking this @mwalker174! We will most probably not need this, so I am closing it. However, since it is the same branch I am currently working on (with temp hard coding) to run the batch pipeline on Azure, I may need to create a new PR on this branch after cleaning it up.

@VJalili VJalili closed this Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants