Skip to content

Commit

Permalink
[ah_var_store] VCF max alt alleles [VS-1334] [VS-1357] (#8806)
Browse files Browse the repository at this point in the history
  • Loading branch information
mcovarr authored May 6, 2024
1 parent e1471f8 commit 9b9178d
Show file tree
Hide file tree
Showing 15 changed files with 185 additions and 18 deletions.
17 changes: 12 additions & 5 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1334_vcf_max_alt_alleles
tags:
- /.*/
- name: GvsImportGenomes
Expand All @@ -182,7 +183,6 @@ workflows:
branches:
- master
- ah_var_store
- rsa_vs_1328
tags:
- /.*/
- name: GvsPrepareRangesCallset
Expand Down Expand Up @@ -260,7 +260,6 @@ workflows:
# `master` is here so there will be *something* under the branches filter, but GVS WDLs are not on `master`
# so this shouldn't show up in the list of available versions in Dockstore or Terra.
- master
- vs_1099_composite_object_inputs
tags:
- /.*/
- name: GvsJointVariantCallsetCost
Expand Down Expand Up @@ -306,7 +305,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1342_maybe_stick_with_gar
- vs_1334_vcf_max_alt_alleles
tags:
- /.*/
- name: GvsIngestTieout
Expand Down Expand Up @@ -354,6 +353,16 @@ workflows:
- ah_var_store
tags:
- /.*/
- name: GvsTieoutVcfMaxAltAlleles
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsTieoutVcfMaxAltAlleles.wdl
filters:
branches:
- ah_var_store
- master
- vs_1334_vcf_max_alt_alleles
tags:
- /.*/
- name: HailFromWdl
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/HailFromWdl.wdl
Expand Down Expand Up @@ -426,7 +435,6 @@ workflows:
- master
- ah_var_store
- EchoCallset
- vs_1315_plink_docker
- name: MergePgenHierarchicalWdl
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/MergePgenHierarchical.wdl
Expand All @@ -442,4 +450,3 @@ workflows:
branches:
- ah_var_store
- EchoCallset
- vs_1315_plink_docker
4 changes: 2 additions & 2 deletions build_docker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,9 @@ fi

echo "Building image to tag ${REPO_PRJ}:${GITHUB_TAG}..."
if [ -n "${IS_PUSH}" ]; then
docker build -t ${REPO_PRJ}:${GITHUB_TAG} --build-arg RELEASE=${RELEASE} --squash .
docker build -t ${REPO_PRJ}:${GITHUB_TAG} --build-arg RELEASE=${RELEASE} --squash . --iidfile /tmp/idfile.txt
else
docker build -t ${REPO_PRJ}:${GITHUB_TAG} --build-arg RELEASE=${RELEASE} .
docker build -t ${REPO_PRJ}:${GITHUB_TAG} --build-arg RELEASE=${RELEASE} . --iidfile /tmp/idfile.txt
fi
# Since we build the docker image with stages, the first build stage for GATK will be leftover in
# the local docker context after executing the above commands. This step reclaims that space automatically
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,16 +47,29 @@ cd gatk
# Log in to Google Cloud
gcloud init
# Configure the credential helper for GCR
gcloud auth configure-docker
FULL_IMAGE_ID=$(cat /tmp/idfile.txt)
# Take the slice of this full Docker image ID that corresponds with the output of `docker images`:
IMAGE_ID=${FULL_IMAGE_ID:7:12}
# The GATK Docker image is based on gatkbase.
IMAGE_TYPE="gatkbase"
TAG=$(python3 ./scripts/variantstore/wdl/extract/build_docker_tag.py --image-id "${IMAGE_ID}" --image-type "${IMAGE_TYPE}")
BASE_REPO="broad-dsde-methods/gvs"
REPO_WITH_TAG="${BASE_REPO}/gatk:${TAG}"
docker tag "${IMAGE_ID}" "${REPO_WITH_TAG}"
# Configure the credential helper for GAR
gcloud auth configure-docker us-central1-docker.pkg.dev
# Tag and push
today="$(date -Idate | sed 's/-/_/g')"
git_hash="$(git rev-parse --short HEAD)"
DOCKER_TAG="us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_${today}_${git_hash}"
DOCKER_IMAGE_ID=$(docker images | head -n 2 | tail -n 1 | awk '{print $3}')
docker tag $DOCKER_IMAGE_ID $DOCKER_TAG
docker push $DOCKER_TAG
GAR_TAG="us-central1-docker.pkg.dev/${REPO_WITH_TAG}"
docker tag "${REPO_WITH_TAG}" "${GAR_TAG}"
docker push "${GAR_TAG}"
echo "Docker image pushed to \"${GAR_TAG}\""
```

Don't forget to shut down (and possibly delete) your VM once you're done!
3 changes: 2 additions & 1 deletion scripts/variantstore/docs/aou/AOU_DELIVERABLES.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@
- This workflow needs to be run with the `extract_table_prefix` input from `GvsPrepareRangesCallset` step.
- This workflow needs to be run with the `filter_set_name` input from `GvsCreateFilterSet` step.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
1. `GvsExtractCallset` / `GvsExtractCallsetPgenMerged` workflow
1. `GvsExtractCallset` / `GvsExtractCallsetPgenMerged` workflows ("small callset" Exome, Clinvar, and ACAF threshold extracts in VCF and PGEN formats respectively)
- You will need to run the `GvsPrepareRangesCallset` workflow for each "[Region](https://support.researchallofus.org/hc/en-us/articles/14929793660948-Smaller-Callsets-for-Analyzing-Short-Read-WGS-SNP-Indel-Data-with-Hail-MT-VCF-and-PLINK)" (interval list) for which a PGEN or VCF deliverable is required for the callset.
- This workflow transforms the data in the vet, ref_ranges, and samples tables into a schema optimized for extract.
- The `enable_extract_table_ttl` input should be set to `true` (the default value is `false`), which will add a TTL of two weeks to the tables it creates.
Expand All @@ -94,6 +94,7 @@
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
- Specify the `interval_weights_bed` appropriate for the PGEN / VCF extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
- For `GvsExtractCallsetPgenMerged` only, select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5
- For `GvsExtractCallset`, make sure to specify the appropriate `maximum_alternate_alleles` value (currently 100).
- These workflows do not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
1. `GvsCalculatePrecisionAndSensitivity` workflow
- Please see the detailed instructions for running the Precision and Sensitivity workflow [here](../../tieout/AoU_PRECISION_SENSITIVITY.md).
Expand Down
4 changes: 4 additions & 0 deletions scripts/variantstore/wdl/GvsExtractCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ workflow GvsExtractCallset {
Float y_bed_weight_scaling = 4
Boolean is_wgs = true
Boolean convert_filtered_genotypes_to_nocalls = false
Int? maximum_alternate_alleles
}

File reference = "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta"
Expand Down Expand Up @@ -238,6 +239,7 @@ workflow GvsExtractCallset {
emit_ads = emit_ads,
convert_filtered_genotypes_to_nocalls = convert_filtered_genotypes_to_nocalls,
write_cost_to_db = write_cost_to_db,
maximum_alternate_alleles = maximum_alternate_alleles,
}
}

Expand Down Expand Up @@ -331,6 +333,7 @@ task ExtractTask {
Int memory_gib

Int? local_sort_max_records_in_ram = 10000000
Int? maximum_alternate_alleles

# for call-caching -- check if DB tables haven't been updated since the last run
String max_last_modified_timestamp
Expand Down Expand Up @@ -385,6 +388,7 @@ task ExtractTask {
~{true='--emit-ads' false='' emit_ads} \
~{true='' false='--use-vqsr-scoring' use_VQSR_lite} \
~{true='--convert-filtered-genotypes-to-no-calls' false='' convert_filtered_genotypes_to_nocalls} \
~{'--maximum-alternate-alleles ' + maximum_alternate_alleles} \
${FILTERING_ARGS} \
--dataset-id ~{dataset_name} \
--call-set-identifier ~{call_set_identifier} \
Expand Down
3 changes: 3 additions & 0 deletions scripts/variantstore/wdl/GvsJointVariantCalling.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ workflow GvsJointVariantCalling {

File? training_python_script
File? scoring_python_script

Int? maximum_alternate_alleles
}

# If is_wgs is true, we'll use the WGS interval list else, we'll use the Exome interval list. We'll currently use
Expand Down Expand Up @@ -223,6 +225,7 @@ workflow GvsJointVariantCalling {
do_not_filter_override = extract_do_not_filter_override,
drop_state = drop_state,
is_wgs = is_wgs,
maximum_alternate_alleles = maximum_alternate_alleles,
}

output {
Expand Down
3 changes: 3 additions & 0 deletions scripts/variantstore/wdl/GvsQuickstartHailIntegration.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ workflow GvsQuickstartHailIntegration {
String? workspace_bucket
String? workspace_id
String? submission_id

Int? maximum_alternate_alleles
}

String project_id = "gvs-internal"
Expand Down Expand Up @@ -87,6 +89,7 @@ workflow GvsQuickstartHailIntegration {
workspace_bucket = effective_workspace_bucket,
workspace_id = effective_workspace_id,
submission_id = effective_submission_id,
maximum_alternate_alleles = maximum_alternate_alleles,
}

call ExtractAvroFilesForHail.GvsExtractAvroFilesForHail {
Expand Down
7 changes: 7 additions & 0 deletions scripts/variantstore/wdl/GvsQuickstartIntegration.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ workflow GvsQuickstartIntegration {
String? gatk_docker
String? hail_version
Boolean chr20_X_Y_only = true
Int? maximum_alternate_alleles
}

File full_wgs_interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
Expand Down Expand Up @@ -100,6 +101,7 @@ workflow GvsQuickstartIntegration {
workspace_id = GetToolVersions.workspace_id,
submission_id = GetToolVersions.submission_id,
hail_version = effective_hail_version,
maximum_alternate_alleles = maximum_alternate_alleles,
}
call QuickstartHailIntegration.GvsQuickstartHailIntegration as GvsQuickstartHailVQSRClassicIntegration {
input:
Expand All @@ -126,6 +128,7 @@ workflow GvsQuickstartIntegration {
workspace_id = GetToolVersions.workspace_id,
submission_id = GetToolVersions.submission_id,
hail_version = effective_hail_version,
maximum_alternate_alleles = maximum_alternate_alleles,
}

if (GvsQuickstartHailVQSRLiteIntegration.used_tighter_gcp_quotas) {
Expand Down Expand Up @@ -171,6 +174,7 @@ workflow GvsQuickstartIntegration {
workspace_bucket = GetToolVersions.workspace_bucket,
workspace_id = GetToolVersions.workspace_id,
submission_id = GetToolVersions.submission_id,
maximum_alternate_alleles = maximum_alternate_alleles,
}
call QuickstartVcfIntegration.GvsQuickstartVcfIntegration as QuickstartVcfVQSRClassicIntegration {
input:
Expand All @@ -197,6 +201,7 @@ workflow GvsQuickstartIntegration {
workspace_bucket = GetToolVersions.workspace_bucket,
workspace_id = GetToolVersions.workspace_id,
submission_id = GetToolVersions.submission_id,
maximum_alternate_alleles = maximum_alternate_alleles,
}

if (QuickstartVcfVQSRClassicIntegration.used_tighter_gcp_quotas) {
Expand Down Expand Up @@ -242,6 +247,7 @@ workflow GvsQuickstartIntegration {
workspace_bucket = GetToolVersions.workspace_bucket,
workspace_id = GetToolVersions.workspace_id,
submission_id = GetToolVersions.submission_id,
maximum_alternate_alleles = maximum_alternate_alleles,
}

if (QuickstartVcfExomeIntegration.used_tighter_gcp_quotas) {
Expand Down Expand Up @@ -280,6 +286,7 @@ workflow GvsQuickstartIntegration {
workspace_bucket = GetToolVersions.workspace_bucket,
workspace_id = GetToolVersions.workspace_id,
submission_id = GetToolVersions.submission_id,
maximum_alternate_alleles = maximum_alternate_alleles,
git_branch_or_tag = git_branch_or_tag,
}

Expand Down
3 changes: 3 additions & 0 deletions scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ workflow GvsQuickstartVcfIntegration {
String? workspace_bucket
String? workspace_id
String? submission_id

Int? maximum_alternate_alleles
}
String project_id = "gvs-internal"

Expand Down Expand Up @@ -108,6 +110,7 @@ workflow GvsQuickstartVcfIntegration {
git_branch_or_tag = git_branch_or_tag,
git_hash = effective_git_hash,
tighter_gcp_quotas = false,
maximum_alternate_alleles = maximum_alternate_alleles,
}

# Only assert identical outputs if we did not filter (filtering is not deterministic) OR if we are using VQSR Lite (which is deterministic)
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsTieoutPgenToVcf.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ task Tieout {
# Now just the contigs we care about
zgrep -E '^##' 0000000000-quickit.vcf.gz | grep -E '^##contig=<ID=chr[12]?[0-9],|^##contig=<ID=chr[XY],' >> vcf_header.txt

# Tab-separated comparision-friendly columns
# Tab-separated comparison-friendly columns
acc=""
for item in \\#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT $(cat ../samples.txt)
do
Expand Down
98 changes: 98 additions & 0 deletions scripts/variantstore/wdl/GvsTieoutVcfMaxAltAlleles.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
version 1.0

import "GvsUtils.wdl" as Utils

workflow GvsTieoutVcfMaxAltAlleles {
input {
String unfiltered_output_gcs_path
String filtered_output_gcs_path
Int max_alt_alleles
String? variants_docker
}

if (!defined(variants_docker)) {
call Utils.GetToolVersions
}

String effective_variants_docker = select_first([variants_docker, GetToolVersions.variants_docker])

call Tieout {
input:
unfiltered_output_gcs_path = unfiltered_output_gcs_path,
filtered_output_gcs_path = filtered_output_gcs_path,
max_alt_alleles = max_alt_alleles,
variants_docker = effective_variants_docker,
}
}

task Tieout {
input {
String unfiltered_output_gcs_path
String filtered_output_gcs_path
Int max_alt_alleles
String variants_docker
}
command <<<
# Prepend date, time and pwd to xtrace log entries.
PS4='\D{+%F %T} \w $ '
set -o errexit -o nounset -o pipefail -o xtrace

mkdir unfiltered filtered compare

cd unfiltered
gcloud storage cp '~{unfiltered_output_gcs_path}/*' .

cd ..
cd filtered
gcloud storage cp '~{filtered_output_gcs_path}/*' .

cd ..

for unfiltered in unfiltered/*.vcf.gz
do
base="$(basename $unfiltered)"
filtered="filtered/${base}"
out=${base%-*}
bcftools isec $filtered $unfiltered -p compare/${out}
done

# Identify the 'isec' output files that represent entries which are "private" to either VCF file being compared.
find . -name README.txt | xargs grep -h private | awk '{print $1}' > private_files.txt

# Turn off errexit, the grep below is expected to return non-zero
set +o errexit

# Find all the private files, print out all the ALTs, look for any lines that do *not* have at least two commas.
# All ALT entries are expected to have two or more commas because the maximum number of ALT alleles was specified
# as two. So the unfiltered file would have entries like ALT1,ALT2,...,ALTN while the filtered file would not,
# and these should be the only differences between the two files.
cat private_files.txt | xargs -n 1 bcftools query -f '%ALT\n' | grep -E -v ',.*,' > unexpected.txt

rc=$?
if [[ $rc -eq 0 ]]
then
echo "Unexpected ALTs found, see 'unexpected' output for details." 1>&2
exit 1
fi

set -o errexit

# Similar to above, except making sure that we did observe some filtered sites. At the time of this
# writing there are 4001 sites filtered for > 2 entries; conservatively setting the threshold at 1000.
filtered_count=$(cat private_files.txt | xargs -n 1 bcftools query -f '%ALT\n' | grep -E ',.*,' | wc -l)
if [[ $filtered_count -lt 1000 ]]
then
echo "Unexpectedly found fewer than 1000 filtered ALTs, see 'tarball' output for details." 1>&2
exit 1
fi

tar cfvz tieout.tgz compare
>>>
output {
File tarball = "tieout.tgz"
File unexpected = "unexpected.txt"
}
runtime {
docker: variants_docker
}
}
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsUtils.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,8 @@ task GetToolVersions {
# there are a handlful of tasks that require the larger GNU libc-based `slim`.
String cloud_sdk_slim_docker = "gcr.io/google.com/cloudsdktool/cloud-sdk:435.0.0-slim"
String variants_docker = "us-central1-docker.pkg.dev/broad-dsde-methods/gvs/variants:2024-04-22-alpine-32804b134a75"
String gatk_docker = "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2024_04_24_61922a303"
String variants_nirvana_docker = "us.gcr.io/broad-dsde-methods/variantstore:nirvana_2022_10_19"
String gatk_docker = "us-central1-docker.pkg.dev/broad-dsde-methods/gvs/gatk:2024-05-01-gatkbase-617d4d1c7f64"
String real_time_genomics_docker = "docker.io/realtimegenomics/rtg-tools:latest"
String gotc_imputation_docker = "us.gcr.io/broad-gotc-prod/imputation-bcf-vcf:1.0.5-1.10.2-0.1.16-1649948623"
String plink_docker = "us-central1-docker.pkg.dev/broad-dsde-methods/gvs/plink2:2024-04-23-slim-a0a65f52cc0e"
Expand Down
Loading

0 comments on commit 9b9178d

Please sign in to comment.