Update Scramble implementation (#722)

broadinstitute · Oct 25, 2024 · 5ea22a5 · 5ea22a5
1 parent 3eccc48
commit 5ea22a5
Show file tree

Hide file tree

Showing 23 changed files with 1,427 additions and 438 deletions.
diff --git a/README.md b/README.md
@@ -42,7 +42,6 @@ A structural variation discovery pipeline for Illumina short-read whole-genome s
 * A workflow execution system supporting the [Workflow Description Language](https://openwdl.org/) (WDL), either:
   * [Cromwell](https://github.com/broadinstitute/cromwell) (v36 or higher). A dedicated server is highly recommended.
   * or [Terra](https://terra.bio/) (note preconfigured GATK-SV workflows are not yet available for this platform)
-* Recommended: [MELT](https://melt.igs.umaryland.edu/). Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.
 * Recommended: [cromshell](https://github.com/broadinstitute/cromshell) for interacting with a dedicated Cromwell server.
 * Recommended: [WOMtool](https://cromwell.readthedocs.io/en/stable/WOMtool/) for validating WDL/json files.
 
@@ -124,16 +123,18 @@ There are two scripts for running the full pipeline:
 
 #### Building inputs
 Example workflow inputs can be found in `/inputs`. Build using `scripts/inputs/build_default_inputs.sh`, which 
-generates input jsons in `/inputs/build`. Except the MELT docker image, all required resources are available in public 
+generates input jsons in `/inputs/build`. All required resources are available in public 
 Google buckets. 
 
 #### MELT
-**Important**: The example input files contain MELT inputs that are NOT public (see [Requirements](#requirements)). These include:
+**Important**: MELT has been replaced with [Scramble](https://github.com/GeneDx/scramble) for mobile element calling. While it is still possible to run GATK-SV with MELT, we no longer support it as a caller. It will be fully deprecated in the future.
+
+Due to licensing restrictions, we cannot redistribute MELT binaries or input files, including the docker image. Some default input files contain MELT inputs that are NOT public (see [Requirements](#requirements)) including:
 
 * `GATKSVPipelineSingleSample.melt_docker` and `GATKSVPipelineBatch.melt_docker` - MELT docker URI (see [Docker readme](https://github.com/talkowski-lab/gatk-sv-v1/blob/master/dockerfiles/README.md))
 * `GATKSVPipelineSingleSample.ref_std_melt_vcfs` - Standardized MELT VCFs ([GatherBatchEvidence](#gather-batch-evidence))
 
-The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting `GATKSVPipelineBatch.use_melt` to `false`.
+The input values are provided only as placeholders. In some workflows, MELT must be enabled with appropriate settings, by providing optional MELT inputs and/or with an explicit option e.g. `GATKSVPipelineBatch.use_melt` to `true`. We do not recommend running both Scramble and MELT together.
 
 #### Execution
 We recommend running the pipeline on a dedicated [Cromwell](https://github.com/broadinstitute/cromwell) server with a [cromshell](https://github.com/broadinstitute/cromshell) client. A batch run can be started with the following commands:
@@ -153,7 +154,7 @@ where `cromwell_config.json` is a Cromwell [workflow options file](https://cromw
 
 ## <a name="overview">Pipeline Overview</a>
 The pipeline consists of a series of modules that perform the following:
-* [GatherSampleEvidence](#gather-sample-evidence): SV evidence collection, including calls from a configurable set of algorithms (Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
+* [GatherSampleEvidence](#gather-sample-evidence): SV evidence collection, including calls from a configurable set of algorithms (Manta, Scramble, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
 * [EvidenceQC](#evidence-qc): Dosage bias scoring and ploidy estimation
 * [GatherBatchEvidence](#gather-batch-evidence): Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
 * [ClusterBatch](#cluster-batch): Variant clustering
@@ -253,18 +254,21 @@ The following sections briefly describe each module and highlights inter-depende
 ## <a name="gather-sample-evidence">GatherSampleEvidence</a>
 *Formerly Module00a*
 
-Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
+Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), [Scramble](https://github.com/GeneDx/scramble), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
+
+The `scramble_clusters` and `scramble_table` are generated as outputs for troubleshooting purposes but not consumed by any downstream workflows.
 
 Note: a list of sample IDs must be provided. Refer to the [sample ID requirements](#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.
 
 #### Inputs:
 * Per-sample BAM or CRAM files aligned to hg38. Index files (`.bai`) must be provided if using BAMs.
 
 #### Outputs:
-* Caller VCFs (Manta, MELT, and/or Wham)
+* Caller VCFs (Manta, Scramble, MELT, and/or Wham)
 * Binned read counts file
 * Split reads (SR) file
 * Discordant read pairs (PE) file
+* Scramble intermediate clusters file and table (not needed downstream)
 
 ## <a name="evidence-qc">EvidenceQC</a>
 *Formerly Module00b*

diff --git a/dockerfiles/scramble/Dockerfile b/dockerfiles/scramble/Dockerfile
@@ -54,7 +54,7 @@ RUN mkdir -p /opt && cd /opt && \
 ENV LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
 
 # install scramble
-ARG SCRAMBLE_COMMIT="f320d604ac030e4a7fa96b0663bcae02994c7d94"
+ARG SCRAMBLE_COMMIT="56b5ae849d16ec1fc83ea1426b0ffc356ee6d99c"
 RUN mkdir /app && cd /app \
     && git clone https://github.com/mwalker174/scramble-gatk-sv.git \
     && cd scramble-gatk-sv \

diff --git a/dockerfiles/sv-base-mini/Dockerfile b/dockerfiles/sv-base-mini/Dockerfile
@@ -4,6 +4,7 @@ ARG UBUNTU_RELEASE="22.04"
 ARG HTSLIB_VERSION="1.15.1"
 ARG BEDTOOLS_VERSION="2.31.0"
 ARG VCFTOOLS_VERSION="0.1.16"
+ARG BWA_COMMIT="139f68fc4c3747813783a488aef2adc86626b01b"
 
 ARG APT_REQUIRED_PACKAGES="/opt/apt-required-packages.list"
 
@@ -14,7 +15,7 @@ ARG DEBIAN_FRONTEND=noninteractive
 RUN apt-get -qqy update --fix-missing && \
     apt-get -qqy dist-upgrade && \
     apt-get -qqy install --no-install-recommends \
-        ca-certificates autoconf automake bzip2 g++ make wget pkgconf python2 \
+        ca-certificates autoconf automake bzip2 g++ git make wget pkgconf python2 \
         libssl-dev libbz2-dev libcurl4-openssl-dev liblzma-dev libncurses-dev zlib1g-dev libdeflate-dev
 
 # install samtools
@@ -51,6 +52,19 @@ RUN wget -q https://github.com/arq5x/bedtools2/releases/download/v$BEDTOOLS_VERS
     mv bedtools.static /opt/bedtools/bin/bedtools && \
     chmod a+x /opt/bedtools/bin/bedtools
 
+# install bwa
+# must do from source because of compiler error in latest release (see https://github.com/lh3/bwa/issues/387)
+ARG BWA_COMMIT
+RUN cd /opt && \
+    git clone https://github.com/lh3/bwa.git && \
+    cd bwa && \
+    git checkout $BWA_COMMIT && \
+    make -s && \
+    cd .. && \
+    mkdir -p /opt/bin && \
+    mv /opt/bwa/bwa /opt/bin/ && \
+    rm -r bwa
+ENV PATH=/opt/bin:$PATH
 
 ############### stage 1: copy tools and install needed non-dev libraries
 FROM ubuntu:$UBUNTU_RELEASE
@@ -100,3 +114,4 @@ RUN tabix --version
 RUN bcftools --version
 RUN bedtools --version
 RUN vcftools --version
+RUN which bwa
diff --git a/...lates/terra_workspaces/cohort_mode/workflow_configurations/GatherSampleEvidence.json.tmpl b/...lates/terra_workspaces/cohort_mode/workflow_configurations/GatherSampleEvidence.json.tmpl
@@ -1,5 +1,11 @@
 {
   "GatherSampleEvidence.primary_contigs_list": "${workspace.primary_contigs_list}",
+  "GatherSampleEvidence.reference_bwa_alt":	"${workspace.reference_bwa_alt}",
+  "GatherSampleEvidence.reference_bwa_amb":	"${workspace.reference_bwa_amb}",
+  "GatherSampleEvidence.reference_bwa_ann":	"${workspace.reference_bwa_ann}",
+  "GatherSampleEvidence.reference_bwa_bwt":	"${workspace.reference_bwa_bwt}",
+  "GatherSampleEvidence.reference_bwa_pac":	"${workspace.reference_bwa_pac}",
+  "GatherSampleEvidence.reference_bwa_sa":	"${workspace.reference_bwa_sa}",
   "GatherSampleEvidence.reference_fasta": "${workspace.reference_fasta}",
   "GatherSampleEvidence.reference_index": "${workspace.reference_index}",
   "GatherSampleEvidence.reference_dict": "${workspace.reference_dict}",
@@ -12,6 +18,7 @@
 
   "GatherSampleEvidence.manta_region_bed": "${workspace.manta_region_bed}",
   "GatherSampleEvidence.manta_region_bed_index": "${workspace.manta_region_bed_index}",
+  "GatherSampleEvidence.mei_bed": "${workspace.mei_bed}",
   "GatherSampleEvidence.sd_locs_vcf": "${workspace.sd_locs_vcf}",
   "GatherSampleEvidence.melt_standard_vcf_header": "${workspace.melt_standard_vcf_header}",
 
@@ -22,6 +29,7 @@
   "GatherSampleEvidence.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
   "GatherSampleEvidence.manta_docker": "${workspace.manta_docker}",
   "GatherSampleEvidence.wham_docker": "${workspace.wham_docker}",
+  "GatherSampleEvidence.scramble_docker": "${workspace.scramble_docker}",
   "GatherSampleEvidence.genomes_in_the_cloud_docker" : "${workspace.genomes_in_the_cloud_docker}",
   "GatherSampleEvidence.gatk_docker" : "${workspace.gatk_docker}",
   "GatherSampleEvidence.gatk_docker_pesr_override": "${workspace.gatk_docker_pesr_override}",

diff --git a/inputs/templates/terra_workspaces/cohort_mode/workspace.tsv.tmpl b/inputs/templates/terra_workspaces/cohort_mode/workspace.tsv.tmpl
@@ -42,6 +42,12 @@ primary_contigs_list	{{ reference_resources.primary_contigs_list }}
 protein_coding_gtf	{{ reference_resources.protein_coding_gtf }}
 recalibrate_gq_model_file	{{ reference_resources.aou_recalibrate_gq_model_file }}
 reference_build	{{ reference_resources.reference_build }}
+reference_bwa_alt	{{ reference_resources.reference_bwa_alt }}
+reference_bwa_amb	{{ reference_resources.reference_bwa_amb }}
+reference_bwa_ann	{{ reference_resources.reference_bwa_ann }}
+reference_bwa_bwt	{{ reference_resources.reference_bwa_bwt }}
+reference_bwa_pac	{{ reference_resources.reference_bwa_pac }}
+reference_bwa_sa	{{ reference_resources.reference_bwa_sa }}
 reference_dict	{{ reference_resources.reference_dict }}
 reference_fasta	{{ reference_resources.reference_fasta }}
 reference_index	{{ reference_resources.reference_index }}

diff --git a/...KSVPipelineSingleSample.no_melt.json.tmpl → ...mple/GATKSVPipelineSingleSample.json.tmpl b/...KSVPipelineSingleSample.no_melt.json.tmpl → ...mple/GATKSVPipelineSingleSample.json.tmpl
@@ -3,8 +3,6 @@
   "GATKSVPipelineSingleSample.batch" : "${this.sample_id}",
   "GATKSVPipelineSingleSample.bam_or_cram_file" : "${this.bam_or_cram_file}",
 
-  "GATKSVPipelineSingleSample.use_melt": "false",
-
   "GATKSVPipelineSingleSample.cutoffs" : "${workspace.ref_panel_cutoffs}",
 
   "GATKSVPipelineSingleSample.genotype_pesr_pesr_sepcutoff" : "${workspace.ref_panel_genotype_pesr_pesr_sepcutoff}",
@@ -49,7 +47,14 @@
   "GATKSVPipelineSingleSample.max_ref_panel_carrier_freq": 0.03,
   "GATKSVPipelineSingleSample.manta_region_bed" : "${workspace.reference_manta_region_bed}",
   "GATKSVPipelineSingleSample.manta_region_bed_index" : "${workspace.reference_manta_region_bed_index}",
+  "GATKSVPipelineSingleSample.mei_bed": "${workspace.mei_bed}",
   "GATKSVPipelineSingleSample.sd_locs_vcf" : "${workspace.reference_sd_locs_vcf}",
+  "GATKSVPipelineSingleSample.reference_bwa_alt" : "${workspace.reference_bwa_alt}",
+  "GATKSVPipelineSingleSample.reference_bwa_amb" : "${workspace.reference_bwa_amb}",
+  "GATKSVPipelineSingleSample.reference_bwa_ann" : "${workspace.reference_bwa_ann}",
+  "GATKSVPipelineSingleSample.reference_bwa_bwt" : "${workspace.reference_bwa_bwt}",
+  "GATKSVPipelineSingleSample.reference_bwa_pac" : "${workspace.reference_bwa_pac}",
+  "GATKSVPipelineSingleSample.reference_bwa_sa" : "${workspace.reference_bwa_sa}",
   "GATKSVPipelineSingleSample.reference_dict" : "${workspace.reference_dict}",
   "GATKSVPipelineSingleSample.reference_fasta" : "${workspace.reference_fasta}",
   "GATKSVPipelineSingleSample.reference_index" : "${workspace.reference_index}",

diff --git a/inputs/templates/terra_workspaces/single_sample/workspace.tsv.tmpl b/inputs/templates/terra_workspaces/single_sample/workspace.tsv.tmpl
@@ -38,6 +38,12 @@ reference_name	{{ reference_resources.name }}
 reference_allosome_file	{{ reference_resources.allosome_file }}
 reference_autosome_file	{{ reference_resources.autosome_file }}
 reference_bin_exclude	{{ reference_resources.bin_exclude }}
+reference_bwa_alt	{{ reference_resources.reference_bwa_alt }}
+reference_bwa_amb	{{ reference_resources.reference_bwa_amb }}
+reference_bwa_ann	{{ reference_resources.reference_bwa_ann }}
+reference_bwa_bwt	{{ reference_resources.reference_bwa_bwt }}
+reference_bwa_pac	{{ reference_resources.reference_bwa_pac }}
+reference_bwa_sa	{{ reference_resources.reference_bwa_sa }}
 reference_cnmops_exclude_list	{{ reference_resources.cnmops_exclude_list }}
 reference_contig_ploidy_priors	{{ reference_resources.contig_ploidy_priors }}
 reference_copy_number_autosomal_contigs	{{ reference_resources.copy_number_autosomal_contigs }}

diff --git a/inputs/templates/test/ApplyManualVariantFilter/ApplyManualVariantFilter.json.tmpl b/inputs/templates/test/ApplyManualVariantFilter/ApplyManualVariantFilter.json.tmpl
@@ -2,6 +2,6 @@
   "ApplyManualVariantFilter.vcf" :   {{ test_batch.clean_vcf | tojson }},
   "ApplyManualVariantFilter.prefix" : {{ test_batch.name | tojson }},
   "ApplyManualVariantFilter.sv_base_mini_docker":{{ dockers.sv_base_mini_docker | tojson }},
-  "ApplyManualVariantFilter.bcftools_filter": "SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\"",
-  "ApplyManualVariantFilter.filter_name": "filter_wham_only_del"
+  "ApplyManualVariantFilter.bcftools_filter": "(SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\") || (ALT==\"<INS:ME:SVA>\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"scramble\" && HIGH_SR_BACKGROUND==1)",
+  "ApplyManualVariantFilter.filter_name": "high_algorithm_fp_rate"
 }
diff --git a/inputs/templates/test/GATKSVPipelineBatch/GATKSVPipelineBatch.FromSampleEvidence.json.tmpl b/inputs/templates/test/GATKSVPipelineBatch/GATKSVPipelineBatch.FromSampleEvidence.json.tmpl
@@ -1,9 +1,4 @@
 {
-  "GATKSVPipelineBatch.use_manta": "true",
-  "GATKSVPipelineBatch.use_wham": "true",
-  "GATKSVPipelineBatch.use_melt": "true",
-  "GATKSVPipelineBatch.use_scramble": "false",
-
   "GATKSVPipelineBatch.name": {{ test_batch.name | tojson }},
   "GATKSVPipelineBatch.ped_file": {{ test_batch.ped_file | tojson }},
   "GATKSVPipelineBatch.samples": {{ test_batch.samples | tojson }},
@@ -54,6 +49,14 @@
   "GATKSVPipelineBatch.GatherSampleEvidenceBatch.melt_standard_vcf_header": {{ reference_resources.melt_std_vcf_header | tojson }},
   "GATKSVPipelineBatch.GatherSampleEvidenceBatch.wham_include_list_bed_file": {{ reference_resources.wham_include_list_bed_file | tojson }},
   "GATKSVPipelineBatch.GatherSampleEvidenceBatch.preprocessed_intervals": {{ reference_resources.preprocessed_intervals | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.mei_bed": {{ reference_resources.mei_bed | tojson }},
+
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_alt": {{ reference_resources.reference_bwa_alt | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_amb": {{ reference_resources.reference_bwa_amb | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_ann": {{ reference_resources.reference_bwa_ann | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_bwt": {{ reference_resources.reference_bwa_bwt | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_pac": {{ reference_resources.reference_bwa_pac | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_sa": {{ reference_resources.reference_bwa_sa | tojson }},
 
   "GATKSVPipelineBatch.EvidenceQC.wgd_scoring_mask": {{ reference_resources.wgd_scoring_mask | tojson }},
   "GATKSVPipelineBatch.EvidenceQC.run_vcf_qc": "false",

diff --git a/inputs/templates/test/GATKSVPipelineBatch/GATKSVPipelineBatch.json.tmpl b/inputs/templates/test/GATKSVPipelineBatch/GATKSVPipelineBatch.json.tmpl
@@ -1,9 +1,4 @@
 {
-  "GATKSVPipelineBatch.use_manta": "true",
-  "GATKSVPipelineBatch.use_wham": "true",
-  "GATKSVPipelineBatch.use_melt": "true",
-  "GATKSVPipelineBatch.use_scramble": "false",
-
   "GATKSVPipelineBatch.name": {{ test_batch.name | tojson }},
   "GATKSVPipelineBatch.ped_file": {{ test_batch.ped_file | tojson }},
   "GATKSVPipelineBatch.samples": {{ test_batch.samples | tojson }},
@@ -48,6 +43,14 @@
   "GATKSVPipelineBatch.GatherSampleEvidenceBatch.melt_standard_vcf_header": {{ reference_resources.melt_std_vcf_header | tojson }},
   "GATKSVPipelineBatch.GatherSampleEvidenceBatch.wham_include_list_bed_file": {{ reference_resources.wham_include_list_bed_file | tojson }},
   "GATKSVPipelineBatch.GatherSampleEvidenceBatch.preprocessed_intervals": {{ reference_resources.preprocessed_intervals | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.mei_bed": {{ reference_resources.mei_bed | tojson }},
+
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_alt": {{ reference_resources.reference_bwa_alt | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_amb": {{ reference_resources.reference_bwa_amb | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_ann": {{ reference_resources.reference_bwa_ann | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_bwt": {{ reference_resources.reference_bwa_bwt | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_pac": {{ reference_resources.reference_bwa_pac | tojson }},
+  "GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_sa": {{ reference_resources.reference_bwa_sa | tojson }},
 
   "GATKSVPipelineBatch.EvidenceQC.wgd_scoring_mask": {{ reference_resources.wgd_scoring_mask | tojson }},
   "GATKSVPipelineBatch.EvidenceQC.run_vcf_qc": "false",