Skip to content

Commit

Permalink
Update Scramble implementation (#722)
Browse files Browse the repository at this point in the history
  • Loading branch information
mwalker174 authored Oct 25, 2024
1 parent 3eccc48 commit 5ea22a5
Show file tree
Hide file tree
Showing 23 changed files with 1,427 additions and 438 deletions.
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ A structural variation discovery pipeline for Illumina short-read whole-genome s
* A workflow execution system supporting the [Workflow Description Language](https://openwdl.org/) (WDL), either:
* [Cromwell](https://github.com/broadinstitute/cromwell) (v36 or higher). A dedicated server is highly recommended.
* or [Terra](https://terra.bio/) (note preconfigured GATK-SV workflows are not yet available for this platform)
* Recommended: [MELT](https://melt.igs.umaryland.edu/). Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.
* Recommended: [cromshell](https://github.com/broadinstitute/cromshell) for interacting with a dedicated Cromwell server.
* Recommended: [WOMtool](https://cromwell.readthedocs.io/en/stable/WOMtool/) for validating WDL/json files.

Expand Down Expand Up @@ -124,16 +123,18 @@ There are two scripts for running the full pipeline:

#### Building inputs
Example workflow inputs can be found in `/inputs`. Build using `scripts/inputs/build_default_inputs.sh`, which
generates input jsons in `/inputs/build`. Except the MELT docker image, all required resources are available in public
generates input jsons in `/inputs/build`. All required resources are available in public
Google buckets.

#### MELT
**Important**: The example input files contain MELT inputs that are NOT public (see [Requirements](#requirements)). These include:
**Important**: MELT has been replaced with [Scramble](https://github.com/GeneDx/scramble) for mobile element calling. While it is still possible to run GATK-SV with MELT, we no longer support it as a caller. It will be fully deprecated in the future.

Due to licensing restrictions, we cannot redistribute MELT binaries or input files, including the docker image. Some default input files contain MELT inputs that are NOT public (see [Requirements](#requirements)) including:

* `GATKSVPipelineSingleSample.melt_docker` and `GATKSVPipelineBatch.melt_docker` - MELT docker URI (see [Docker readme](https://github.com/talkowski-lab/gatk-sv-v1/blob/master/dockerfiles/README.md))
* `GATKSVPipelineSingleSample.ref_std_melt_vcfs` - Standardized MELT VCFs ([GatherBatchEvidence](#gather-batch-evidence))

The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting `GATKSVPipelineBatch.use_melt` to `false`.
The input values are provided only as placeholders. In some workflows, MELT must be enabled with appropriate settings, by providing optional MELT inputs and/or with an explicit option e.g. `GATKSVPipelineBatch.use_melt` to `true`. We do not recommend running both Scramble and MELT together.

#### Execution
We recommend running the pipeline on a dedicated [Cromwell](https://github.com/broadinstitute/cromwell) server with a [cromshell](https://github.com/broadinstitute/cromshell) client. A batch run can be started with the following commands:
Expand All @@ -153,7 +154,7 @@ where `cromwell_config.json` is a Cromwell [workflow options file](https://cromw

## <a name="overview">Pipeline Overview</a>
The pipeline consists of a series of modules that perform the following:
* [GatherSampleEvidence](#gather-sample-evidence): SV evidence collection, including calls from a configurable set of algorithms (Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
* [GatherSampleEvidence](#gather-sample-evidence): SV evidence collection, including calls from a configurable set of algorithms (Manta, Scramble, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
* [EvidenceQC](#evidence-qc): Dosage bias scoring and ploidy estimation
* [GatherBatchEvidence](#gather-batch-evidence): Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
* [ClusterBatch](#cluster-batch): Variant clustering
Expand Down Expand Up @@ -253,18 +254,21 @@ The following sections briefly describe each module and highlights inter-depende
## <a name="gather-sample-evidence">GatherSampleEvidence</a>
*Formerly Module00a*

Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), [Scramble](https://github.com/GeneDx/scramble), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.

The `scramble_clusters` and `scramble_table` are generated as outputs for troubleshooting purposes but not consumed by any downstream workflows.

Note: a list of sample IDs must be provided. Refer to the [sample ID requirements](#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

#### Inputs:
* Per-sample BAM or CRAM files aligned to hg38. Index files (`.bai`) must be provided if using BAMs.

#### Outputs:
* Caller VCFs (Manta, MELT, and/or Wham)
* Caller VCFs (Manta, Scramble, MELT, and/or Wham)
* Binned read counts file
* Split reads (SR) file
* Discordant read pairs (PE) file
* Scramble intermediate clusters file and table (not needed downstream)

## <a name="evidence-qc">EvidenceQC</a>
*Formerly Module00b*
Expand Down
2 changes: 1 addition & 1 deletion dockerfiles/scramble/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ RUN mkdir -p /opt && cd /opt && \
ENV LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

# install scramble
ARG SCRAMBLE_COMMIT="f320d604ac030e4a7fa96b0663bcae02994c7d94"
ARG SCRAMBLE_COMMIT="56b5ae849d16ec1fc83ea1426b0ffc356ee6d99c"
RUN mkdir /app && cd /app \
&& git clone https://github.com/mwalker174/scramble-gatk-sv.git \
&& cd scramble-gatk-sv \
Expand Down
17 changes: 16 additions & 1 deletion dockerfiles/sv-base-mini/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ ARG UBUNTU_RELEASE="22.04"
ARG HTSLIB_VERSION="1.15.1"
ARG BEDTOOLS_VERSION="2.31.0"
ARG VCFTOOLS_VERSION="0.1.16"
ARG BWA_COMMIT="139f68fc4c3747813783a488aef2adc86626b01b"

ARG APT_REQUIRED_PACKAGES="/opt/apt-required-packages.list"

Expand All @@ -14,7 +15,7 @@ ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
ca-certificates autoconf automake bzip2 g++ make wget pkgconf python2 \
ca-certificates autoconf automake bzip2 g++ git make wget pkgconf python2 \
libssl-dev libbz2-dev libcurl4-openssl-dev liblzma-dev libncurses-dev zlib1g-dev libdeflate-dev

# install samtools
Expand Down Expand Up @@ -51,6 +52,19 @@ RUN wget -q https://github.com/arq5x/bedtools2/releases/download/v$BEDTOOLS_VERS
mv bedtools.static /opt/bedtools/bin/bedtools && \
chmod a+x /opt/bedtools/bin/bedtools

# install bwa
# must do from source because of compiler error in latest release (see https://github.com/lh3/bwa/issues/387)
ARG BWA_COMMIT
RUN cd /opt && \
git clone https://github.com/lh3/bwa.git && \
cd bwa && \
git checkout $BWA_COMMIT && \
make -s && \
cd .. && \
mkdir -p /opt/bin && \
mv /opt/bwa/bwa /opt/bin/ && \
rm -r bwa
ENV PATH=/opt/bin:$PATH

############### stage 1: copy tools and install needed non-dev libraries
FROM ubuntu:$UBUNTU_RELEASE
Expand Down Expand Up @@ -100,3 +114,4 @@ RUN tabix --version
RUN bcftools --version
RUN bedtools --version
RUN vcftools --version
RUN which bwa
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
{
"GatherSampleEvidence.primary_contigs_list": "${workspace.primary_contigs_list}",
"GatherSampleEvidence.reference_bwa_alt": "${workspace.reference_bwa_alt}",
"GatherSampleEvidence.reference_bwa_amb": "${workspace.reference_bwa_amb}",
"GatherSampleEvidence.reference_bwa_ann": "${workspace.reference_bwa_ann}",
"GatherSampleEvidence.reference_bwa_bwt": "${workspace.reference_bwa_bwt}",
"GatherSampleEvidence.reference_bwa_pac": "${workspace.reference_bwa_pac}",
"GatherSampleEvidence.reference_bwa_sa": "${workspace.reference_bwa_sa}",
"GatherSampleEvidence.reference_fasta": "${workspace.reference_fasta}",
"GatherSampleEvidence.reference_index": "${workspace.reference_index}",
"GatherSampleEvidence.reference_dict": "${workspace.reference_dict}",
Expand All @@ -12,6 +18,7 @@

"GatherSampleEvidence.manta_region_bed": "${workspace.manta_region_bed}",
"GatherSampleEvidence.manta_region_bed_index": "${workspace.manta_region_bed_index}",
"GatherSampleEvidence.mei_bed": "${workspace.mei_bed}",
"GatherSampleEvidence.sd_locs_vcf": "${workspace.sd_locs_vcf}",
"GatherSampleEvidence.melt_standard_vcf_header": "${workspace.melt_standard_vcf_header}",

Expand All @@ -22,6 +29,7 @@
"GatherSampleEvidence.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
"GatherSampleEvidence.manta_docker": "${workspace.manta_docker}",
"GatherSampleEvidence.wham_docker": "${workspace.wham_docker}",
"GatherSampleEvidence.scramble_docker": "${workspace.scramble_docker}",
"GatherSampleEvidence.genomes_in_the_cloud_docker" : "${workspace.genomes_in_the_cloud_docker}",
"GatherSampleEvidence.gatk_docker" : "${workspace.gatk_docker}",
"GatherSampleEvidence.gatk_docker_pesr_override": "${workspace.gatk_docker_pesr_override}",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,12 @@ primary_contigs_list {{ reference_resources.primary_contigs_list }}
protein_coding_gtf {{ reference_resources.protein_coding_gtf }}
recalibrate_gq_model_file {{ reference_resources.aou_recalibrate_gq_model_file }}
reference_build {{ reference_resources.reference_build }}
reference_bwa_alt {{ reference_resources.reference_bwa_alt }}
reference_bwa_amb {{ reference_resources.reference_bwa_amb }}
reference_bwa_ann {{ reference_resources.reference_bwa_ann }}
reference_bwa_bwt {{ reference_resources.reference_bwa_bwt }}
reference_bwa_pac {{ reference_resources.reference_bwa_pac }}
reference_bwa_sa {{ reference_resources.reference_bwa_sa }}
reference_dict {{ reference_resources.reference_dict }}
reference_fasta {{ reference_resources.reference_fasta }}
reference_index {{ reference_resources.reference_index }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
"GATKSVPipelineSingleSample.batch" : "${this.sample_id}",
"GATKSVPipelineSingleSample.bam_or_cram_file" : "${this.bam_or_cram_file}",

"GATKSVPipelineSingleSample.use_melt": "false",

"GATKSVPipelineSingleSample.cutoffs" : "${workspace.ref_panel_cutoffs}",

"GATKSVPipelineSingleSample.genotype_pesr_pesr_sepcutoff" : "${workspace.ref_panel_genotype_pesr_pesr_sepcutoff}",
Expand Down Expand Up @@ -49,7 +47,14 @@
"GATKSVPipelineSingleSample.max_ref_panel_carrier_freq": 0.03,
"GATKSVPipelineSingleSample.manta_region_bed" : "${workspace.reference_manta_region_bed}",
"GATKSVPipelineSingleSample.manta_region_bed_index" : "${workspace.reference_manta_region_bed_index}",
"GATKSVPipelineSingleSample.mei_bed": "${workspace.mei_bed}",
"GATKSVPipelineSingleSample.sd_locs_vcf" : "${workspace.reference_sd_locs_vcf}",
"GATKSVPipelineSingleSample.reference_bwa_alt" : "${workspace.reference_bwa_alt}",
"GATKSVPipelineSingleSample.reference_bwa_amb" : "${workspace.reference_bwa_amb}",
"GATKSVPipelineSingleSample.reference_bwa_ann" : "${workspace.reference_bwa_ann}",
"GATKSVPipelineSingleSample.reference_bwa_bwt" : "${workspace.reference_bwa_bwt}",
"GATKSVPipelineSingleSample.reference_bwa_pac" : "${workspace.reference_bwa_pac}",
"GATKSVPipelineSingleSample.reference_bwa_sa" : "${workspace.reference_bwa_sa}",
"GATKSVPipelineSingleSample.reference_dict" : "${workspace.reference_dict}",
"GATKSVPipelineSingleSample.reference_fasta" : "${workspace.reference_fasta}",
"GATKSVPipelineSingleSample.reference_index" : "${workspace.reference_index}",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ reference_name {{ reference_resources.name }}
reference_allosome_file {{ reference_resources.allosome_file }}
reference_autosome_file {{ reference_resources.autosome_file }}
reference_bin_exclude {{ reference_resources.bin_exclude }}
reference_bwa_alt {{ reference_resources.reference_bwa_alt }}
reference_bwa_amb {{ reference_resources.reference_bwa_amb }}
reference_bwa_ann {{ reference_resources.reference_bwa_ann }}
reference_bwa_bwt {{ reference_resources.reference_bwa_bwt }}
reference_bwa_pac {{ reference_resources.reference_bwa_pac }}
reference_bwa_sa {{ reference_resources.reference_bwa_sa }}
reference_cnmops_exclude_list {{ reference_resources.cnmops_exclude_list }}
reference_contig_ploidy_priors {{ reference_resources.contig_ploidy_priors }}
reference_copy_number_autosomal_contigs {{ reference_resources.copy_number_autosomal_contigs }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
"ApplyManualVariantFilter.vcf" : {{ test_batch.clean_vcf | tojson }},
"ApplyManualVariantFilter.prefix" : {{ test_batch.name | tojson }},
"ApplyManualVariantFilter.sv_base_mini_docker":{{ dockers.sv_base_mini_docker | tojson }},
"ApplyManualVariantFilter.bcftools_filter": "SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\"",
"ApplyManualVariantFilter.filter_name": "filter_wham_only_del"
"ApplyManualVariantFilter.bcftools_filter": "(SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\") || (ALT==\"<INS:ME:SVA>\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"scramble\" && HIGH_SR_BACKGROUND==1)",
"ApplyManualVariantFilter.filter_name": "high_algorithm_fp_rate"
}
Original file line number Diff line number Diff line change
@@ -1,9 +1,4 @@
{
"GATKSVPipelineBatch.use_manta": "true",
"GATKSVPipelineBatch.use_wham": "true",
"GATKSVPipelineBatch.use_melt": "true",
"GATKSVPipelineBatch.use_scramble": "false",

"GATKSVPipelineBatch.name": {{ test_batch.name | tojson }},
"GATKSVPipelineBatch.ped_file": {{ test_batch.ped_file | tojson }},
"GATKSVPipelineBatch.samples": {{ test_batch.samples | tojson }},
Expand Down Expand Up @@ -54,6 +49,14 @@
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.melt_standard_vcf_header": {{ reference_resources.melt_std_vcf_header | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.wham_include_list_bed_file": {{ reference_resources.wham_include_list_bed_file | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.preprocessed_intervals": {{ reference_resources.preprocessed_intervals | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.mei_bed": {{ reference_resources.mei_bed | tojson }},

"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_alt": {{ reference_resources.reference_bwa_alt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_amb": {{ reference_resources.reference_bwa_amb | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_ann": {{ reference_resources.reference_bwa_ann | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_bwt": {{ reference_resources.reference_bwa_bwt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_pac": {{ reference_resources.reference_bwa_pac | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_sa": {{ reference_resources.reference_bwa_sa | tojson }},

"GATKSVPipelineBatch.EvidenceQC.wgd_scoring_mask": {{ reference_resources.wgd_scoring_mask | tojson }},
"GATKSVPipelineBatch.EvidenceQC.run_vcf_qc": "false",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,4 @@
{
"GATKSVPipelineBatch.use_manta": "true",
"GATKSVPipelineBatch.use_wham": "true",
"GATKSVPipelineBatch.use_melt": "true",
"GATKSVPipelineBatch.use_scramble": "false",

"GATKSVPipelineBatch.name": {{ test_batch.name | tojson }},
"GATKSVPipelineBatch.ped_file": {{ test_batch.ped_file | tojson }},
"GATKSVPipelineBatch.samples": {{ test_batch.samples | tojson }},
Expand Down Expand Up @@ -48,6 +43,14 @@
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.melt_standard_vcf_header": {{ reference_resources.melt_std_vcf_header | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.wham_include_list_bed_file": {{ reference_resources.wham_include_list_bed_file | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.preprocessed_intervals": {{ reference_resources.preprocessed_intervals | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.mei_bed": {{ reference_resources.mei_bed | tojson }},

"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_alt": {{ reference_resources.reference_bwa_alt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_amb": {{ reference_resources.reference_bwa_amb | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_ann": {{ reference_resources.reference_bwa_ann | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_bwt": {{ reference_resources.reference_bwa_bwt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_pac": {{ reference_resources.reference_bwa_pac | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_sa": {{ reference_resources.reference_bwa_sa | tojson }},

"GATKSVPipelineBatch.EvidenceQC.wgd_scoring_mask": {{ reference_resources.wgd_scoring_mask | tojson }},
"GATKSVPipelineBatch.EvidenceQC.run_vcf_qc": "false",
Expand Down
Loading

0 comments on commit 5ea22a5

Please sign in to comment.