-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add genotype filtering Terra workflow configs and documentation #695
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,8 +24,10 @@ A structural variation discovery pipeline for Illumina short-read whole-genome s | |
* [GenotypeBatch](#genotype-batch) - Genotyping | ||
* [RegenotypeCNVs](#regenotype-cnvs) - Genotype refinement (optional) | ||
* [MakeCohortVcf](#make-cohort-vcf) - Cross-batch integration, complex event resolution, and VCF cleanup | ||
* [Module 07](#module07) - Downstream Filtering | ||
* [AnnotateVcf](#annotate-vcf) - Annotation | ||
* [JoinRawCalls](#join-raw-calls) - Merges unfiltered calls across batches | ||
* [SVConcordance](#svconcordance) - Calculates genotype concordance with raw calls | ||
* [FilterGenotypes](#filter-genotypes) - Performs genotype filtering | ||
* [AnnotateVcf](#annotate-vcf) - Functional and allele frequency annotation | ||
* [Module 09](#module09) - QC and Visualization | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we delete this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I will clean out the readme with other unrelated changes when I do a full readme->website transfer later. |
||
* Additional modules - Mosaic and de novo | ||
* [CI/CD](#cicd) | ||
|
@@ -159,8 +161,10 @@ The pipeline consists of a series of modules that perform the following: | |
* [FilterBatch](#filter-batch): Variant filtering; outlier exclusion | ||
* [GenotypeBatch](#genotype-batch): Genotyping | ||
* [MakeCohortVcf](#make-cohort-vcf): Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup | ||
* [Module 07](#module07): Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration; | ||
* [AnnotateVcf](#annotate-vcf): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets; | ||
* [JoinRawCalls](#join-raw-calls): Merges unfiltered calls across batches | ||
* [SVConcordance](#svconcordance): Calculates genotype concordance with raw calls | ||
* [FilterGenotypes](#filter-genotypes): Performs genotype filtering | ||
* [AnnotateVcf](#annotate-vcf): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets | ||
* [Module 09](#module09): Visualization, including scripts that generates IGV screenshots and rd plots. | ||
* Additional modules to be added: de novo and mosaic scripts | ||
|
||
|
@@ -471,23 +475,123 @@ Combines variants across multiple batches, resolves complex variants, re-genotyp | |
#### Outputs: | ||
* Finalized "cleaned" VCF and QC plots | ||
|
||
## <a name="module07">Module 07</a> (in development) | ||
Apply downstream filtering steps to the cleaned VCF to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects. | ||
## <a name="join-raw-calls">JoinRawCalls</a> | ||
|
||
Filtering methods include: | ||
* minGQ - remove variants based on the genotype quality across populations. | ||
Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found at the paths below. These tables assume that GQ has a scale of [0,999], so they will not work with newer VCFs where GQ has a scale of [0,99]. | ||
Merges raw unfiltered calls across batches. Concordance between these genotypes and the joint call set usually can be indicative of variant quality and is used downstream for genotype filtering. | ||
|
||
#### Prerequisites: | ||
* [ClusterBatch](#cluster-batch) | ||
|
||
#### Inputs: | ||
* Clustered Manta, Wham, depth, Scramble, and/or MELT VCF URIs ([ClusterBatch](#cluster-batch)) | ||
* PED file | ||
* Reference sequence | ||
|
||
#### Outputs: | ||
* VCF of clustered raw calls | ||
* Ploidy table | ||
|
||
## <a name="svconcordance">SVConcordance</a> | ||
|
||
Computes genotype concordance metrics between all variants in the joint call set and raw calls. | ||
|
||
#### Prerequisites: | ||
* [MakeCohortVcf](#make-cohort-vcf) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably update the README with the four subsections of MakeCohortVcf and change this prereq to CleanVcf. At the very least, the MakeCohortVcf section should describe the subworkflows. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right. I think this is a bit beyond the scope of this PR but I am going to create a new ticket for the website update and will mention this. |
||
* [JoinRawCalls](#join-raw-calls) | ||
|
||
#### Inputs: | ||
* Cleaned ("eval") VCF URI ([MakeCohortVcf](#make-cohort-vcf)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should make our docs more consistent about specifying URIs or not - this varies throughout the README. I would lean towards just listing the files without saying URI every time |
||
* Joined raw call ("truth") VCF URI ([JoinRawCalls](#join-raw-calls)) | ||
* Reference dictionary URI | ||
|
||
#### Outputs: | ||
* VCF with concordance annotations | ||
|
||
## <a name="filter-genotypes">FilterGenotypes</a> | ||
|
||
Performs genotype quality recalibration using a machine learning model based on [xgboost](https://github.com/dmlc/xgboost), and filters genotypes. | ||
|
||
The ML model uses the following features: | ||
|
||
* Genotype properties: | ||
* Non-reference and no-call allele counts | ||
* Genotype quality (GQ) | ||
* Supporting evidence types (EV) and respective genotype qualities (PE_GQ, SR_GQ, RD_GQ) | ||
* Raw call concordance (CONC_ST) | ||
* Variant properties: | ||
* Variant type (SVTYPE) and size (SVLEN) | ||
* FILTER status | ||
* Calling algorithms (ALGORITHMS) | ||
* Supporting evidence types (EVIDENCE) | ||
* Two-sided SR support flag (BOTHSIDES_SUPPORT) | ||
* Evidence overdispersion flag (PESR_GT_OVERDISPERSION) | ||
* SR noise flag (HIGH_SR_BACKGROUND) | ||
* Raw call concordance (STATUS, NON_REF_GENOTYPE_CONCORDANCE, VAR_PPV, VAR_SENSITIVITY, TRUTH_AF) | ||
* Reference context with respect to UCSC Genome Browser tracks: | ||
* RepeatMasker | ||
* Segmental duplications | ||
* Simple repeats | ||
* K-mer mappability (umap_s100 and umap_s24) | ||
|
||
For ease of use, we provide a model pre-trained on high-quality data with truth data derived from long-read calls: | ||
``` | ||
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt | ||
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt | ||
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt | ||
gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/gatk-sv-recalibrator.aou_phase_1.v1.model | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to update this to the Phase 2 model at some point? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Created #729 |
||
``` | ||
See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7) for further details on model training. | ||
|
||
All valid genotypes are annotated with a "scaled logit" (SL) score, which is rescaled to non-negative adjusted GQs on [1, 99]. Note that the rescaled GQs should *not* be interpreted as probabilities. Original genotype qualities are retained in the OGQ field. | ||
|
||
A more positive SL score indicates higher probability that the given genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices). | ||
|
||
This workflow can be run in one of two modes: | ||
|
||
1. (Recommended) The user explicitly provides a set of SL cutoffs through the `sl_filter_args` parameter, e.g. | ||
``` | ||
"--small-del-threshold 93 --medium-del-threshold 150 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13 --inv-threshold -19" | ||
``` | ||
Genotypes with SL scores less than the cutoffs are set to no-call (`./.`). The above values were taken directly from Appendix N of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7 ](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7). Users should adjust the thresholds depending on data quality and desired accuracy. Please see the arguments in [this script](https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/apply_sl_filter.py) for all available options. | ||
|
||
2. (Advanced) The user provides truth labels for a subset of non-reference calls, and SL cutoffs are automatically optimized. These truth labels should be provided as a json file in the following format: | ||
``` | ||
{ | ||
"sample_1": { | ||
"good_variant_ids": ["variant_1", "variant_3"], | ||
"bad_variant_ids": ["variant_5", "variant_10"] | ||
}, | ||
"sample_2": { | ||
"good_variant_ids": ["variant_2", "variant_13"], | ||
"bad_variant_ids": ["variant_8", "variant_11"] | ||
} | ||
} | ||
``` | ||
where "good_variant_ids" and "bad_variant_ids" are lists of variant IDs corresponding to non-reference (i.e. het or hom-var) sample genotypes that are true positives and false positives, respectively. SL cutoffs are optimized by maximizing the [F-score](https://en.wikipedia.org/wiki/F-score) with "beta" parameter `fmax_beta`, which modulates the weight given to precision over recall (lower values give higher precision). | ||
|
||
In both modes, the workflow additionally filters variants based on the "no-call rate", the proportion of genotypes that were filtered in a given variant. Variants exceeding the `no_call_rate_cutoff` are assigned a `HIGH_NCR` filter status. | ||
|
||
We recommend users observe the following basic criteria to assess the overall quality of the filtered call set: | ||
|
||
* BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches | ||
* FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs | ||
* FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation | ||
* Number of PASS variants (excluding BND) between 7,000 and 11,000. | ||
* At least 75% of variants in Hardy-Weinberg equilibrium (HWE). Note that this could be lower, depending on how how closely the cohort adheres to the assumptions of the Hardy-Weinberg model. However, HWE is expected to at least improve after filtering. | ||
* Low *de novo* inheritance rate (if applicable), typically 5-10%. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we'll want to expand on our docs for assessing callset QC when we release the featured workspace. In particular we should document the expectations for the size distribution since we rely heavily on that, and SVs by type per genome. That's probably out of scope for this PR but wanted to mention since it's related |
||
|
||
These criteria can be assessed from the plots in the `main_vcf_qc_tarball` output, which is generated by default. | ||
|
||
#### Prerequisites: | ||
* [SVConcordance](#svconcordance) | ||
|
||
#### Inputs: | ||
* VCF with genotype concordance annotations URI ([SVConcordance](#svconcordance)) | ||
* Ploidy table URI ([JoinRawCalls](#join-raw-calls)) | ||
* GQRecalibrator model URI | ||
* Either a set of SL cutoffs or truth labels | ||
|
||
#### Outputs: | ||
* Filtered VCF | ||
* Call set QC plots (optional) | ||
* Optimized SL cutoffs with filtering QC plots and data tables (if running mode [2] with truth labels) | ||
* VCF with only SL annotation and GQ recalibration (before filtering) | ||
|
||
## <a name="annotate-vcf">AnnotateVcf</a> (in development) | ||
## <a name="annotate-vcf">AnnotateVcf</a> | ||
*Formerly Module08Annotation* | ||
|
||
Add annotations, such as the inferred function and allele frequencies of variants, to final VCF. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
{ | ||
"FilterGenotypes.vcf": "${this.concordance_vcf}", | ||
"FilterGenotypes.output_prefix": "${this.sample_set_set_id}", | ||
"FilterGenotypes.ploidy_table": "${this.ploidy_table}", | ||
"FilterGenotypes.gq_recalibrator_model_file": "${workspace.recalibrate_gq_model_file}", | ||
"FilterGenotypes.sl_filter_args": "--small-del-threshold 93 --medium-del-threshold 150 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13 --inv-threshold -19", | ||
|
||
"FilterGenotypes.genome_tracks": "${workspace.recalibrate_gq_genome_tracks}", | ||
"FilterGenotypes.genome_tracks": [ | ||
{{ reference_resources.recalibrate_gq_genome_track_repeatmasker | tojson }}, | ||
{{ reference_resources.recalibrate_gq_genome_track_segdup | tojson }}, | ||
{{ reference_resources.recalibrate_gq_genome_track_simple_repeats | tojson }}, | ||
{{ reference_resources.recalibrate_gq_genome_track_umap100 | tojson }}, | ||
{{ reference_resources.recalibrate_gq_genome_track_umap24 | tojson }} | ||
], | ||
"FilterGenotypes.recalibrate_gq_args": [ | ||
"--keep-homvar false", | ||
"--keep-homref true", | ||
"--keep-multiallelic true", | ||
"--skip-genotype-filtering true", | ||
"--min-samples-to-estimate-allele-frequency -1" | ||
], | ||
|
||
"FilterGenotypes.ped_file": "${workspace.cohort_ped_file}", | ||
"FilterGenotypes.primary_contigs_fai": "${workspace.primary_contigs_fai}", | ||
"FilterGenotypes.site_level_comparison_datasets": [ | ||
{{ reference_resources.ccdg_abel_site_level_benchmarking_dataset | tojson }}, | ||
{{ reference_resources.gnomad_v2_collins_site_level_benchmarking_dataset | tojson }}, | ||
{{ reference_resources.hgsv_byrska_bishop_site_level_benchmarking_dataset | tojson }}, | ||
{{ reference_resources.thousand_genomes_site_level_benchmarking_dataset | tojson }} | ||
], | ||
"FilterGenotypes.runtime_override_plot_qc_per_family": { | ||
"mem_gb": 15, | ||
"disk_gb": 100 | ||
}, | ||
|
||
"FilterGenotypes.linux_docker": "${workspace.linux_docker}", | ||
"FilterGenotypes.gatk_docker": "${workspace.gq_recalibrator_docker}", | ||
"FilterGenotypes.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", | ||
"FilterGenotypes.sv_pipeline_docker": "${workspace.sv_pipeline_docker}" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
{ | ||
"JoinRawCalls.gatk_docker": "${workspace.gatk_docker}", | ||
"JoinRawCalls.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", | ||
"JoinRawCalls.sv_pipeline_docker": "${workspace.sv_pipeline_docker}", | ||
|
||
"JoinRawCalls.clustered_depth_vcfs" : "${this.sample_sets.clustered_depth_vcf}", | ||
"JoinRawCalls.clustered_depth_vcf_indexes" : "${this.sample_sets.clustered_depth_vcf_index}", | ||
|
||
"JoinRawCalls.clustered_manta_vcfs" : "${this.sample_sets.clustered_manta_vcf}", | ||
"JoinRawCalls.clustered_manta_vcf_indexes" : "${this.sample_sets.clustered_manta_vcf_index}", | ||
|
||
"JoinRawCalls.clustered_wham_vcfs" : "${this.sample_sets.clustered_wham_vcf}", | ||
"JoinRawCalls.clustered_wham_vcf_indexes" : "${this.sample_sets.clustered_wham_vcf_index}", | ||
|
||
"JoinRawCalls.clustered_scramble_vcfs" : "${this.sample_sets.clustered_scramble_vcf}", | ||
"JoinRawCalls.clustered_scramble_vcf_indexes" : "${this.sample_sets.clustered_scramble_vcf_index}", | ||
|
||
"JoinRawCalls.FormatVcfForGatk.formatter_args": "--fix-end", | ||
|
||
"JoinRawCalls.ped_file": "${workspace.cohort_ped_file}", | ||
|
||
"JoinRawCalls.contig_list": "${workspace.primary_contigs_list}", | ||
"JoinRawCalls.reference_fasta": "${workspace.reference_fasta}", | ||
"JoinRawCalls.reference_fasta_fai": "${workspace.reference_index}", | ||
"JoinRawCalls.reference_dict": "${workspace.reference_dict}", | ||
|
||
"JoinRawCalls.prefix": "${this.sample_set_set_id}" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
{ | ||
"SVConcordance.gatk_docker": "${workspace.gatk_docker}", | ||
"SVConcordance.sv_base_mini_docker": "${workspace.sv_base_mini_docker}", | ||
|
||
"SVConcordance.eval_vcf" : "${this.cleaned_vcf}", | ||
"SVConcordance.truth_vcf" : "${this.joined_raw_calls_vcf}", | ||
|
||
"SVConcordance.output_prefix": "${this.sample_set_set_id}", | ||
|
||
"SVConcordance.contig_list": "${workspace.primary_contigs_list}", | ||
"SVConcordance.reference_dict": "${workspace.reference_dict}" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ gatk_docker {{ dockers.gatk_docker }} | |
gatk_docker_pesr_override {{ dockers.gatk_docker_pesr_override }} | ||
gcnv_gatk_docker {{ dockers.gatk_docker }} | ||
genomes_in_the_cloud_docker {{ dockers.genomes_in_the_cloud_docker }} | ||
gq_recalibrator_docker {{ dockers.gq_recalibrator_docker }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we get this merged with the other GATK docker soon? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unlikely since this tool is still on a branch |
||
linux_docker {{ dockers.linux_docker }} | ||
manta_docker {{ dockers.manta_docker }} | ||
samtools_cloud_docker {{ dockers.samtools_cloud_docker }} | ||
|
@@ -39,6 +40,7 @@ preprocessed_intervals {{ reference_resources.preprocessed_intervals }} | |
primary_contigs_fai {{ reference_resources.primary_contigs_fai }} | ||
primary_contigs_list {{ reference_resources.primary_contigs_list }} | ||
protein_coding_gtf {{ reference_resources.protein_coding_gtf }} | ||
recalibrate_gq_model_file {{ reference_resources.aou_recalibrate_gq_model_file }} | ||
reference_build {{ reference_resources.reference_build }} | ||
reference_dict {{ reference_resources.reference_dict }} | ||
reference_fasta {{ reference_resources.reference_fasta }} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to add this documentation to the website, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'll want everything on the website by the time we release the featured workspace. I think it makes sense to me to get the README and dashboard updated so we can update the template Terra workspace, then update the website after - does that sound good?