Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine complex variants and translocations #736

Merged
merged 28 commits into from
Oct 29, 2024
Merged

Conversation

epiercehoffman
Copy link
Collaborator

@epiercehoffman epiercehoffman commented Oct 21, 2024

Updates

This PR updates and integrates the ManualReview workflow originally authored by @xuefzhao and previously reviewed by @cwhelan into the main GATK-SV pipeline.

  • Add RefineComplexVariants workflow (after CleanVcf) to refine complex variants and perform filtering of complex variants and translocations based on PE and RD evidence reassessment
    • Updates were made to parametrize the required PE count, reduce lines of code and number of files, correct handling of certain complex subtypes, review only PASS CPX variants, remove extraneous functionalities, and require only inputs produced by previous steps of GATK-SV
  • Rescue mobile element deletions (LINE1 and HERVK) that were converted to BND for lack of depth evidence (5-10kb) at the end of CleanVcf
    • Updates were made to cap the size of potential ME DELs at 10kb and to add SVLEN to converted records
  • Set FILTER to PASS if empty in FilterGenotypes
  • Recalculate QUAL after genotype filtering
  • Sanitize header at end of FilterGenotypes
  • Add RefineComplexVariants to documentation
  • Add downstream steps to dockstore.yml

Questions

  • Rescued ME DELs still have UNRESOLVED_TYPE - do we want to remove? [Yes - removed]
  • After recalculating QUAL, QUAL is . for mCNVs - is this ok? [Yes - left as-is]
  • Are we ready to update the pipeline diagram for v1.0? I can add it to this PR if so [Yes - added]
    terra_workflow_diagram_v1 0

For the future

  • Rewrite RefineComplexVariants as a GATK tool for better coding practices, efficiency, and test coverage
  • Integrate PE and depth evidence assessment for CPX and CTX into GenotypeComplexVariants and eliminate this downstream step
  • Check for LINE1 and HERVK overlap when requiring depth evidence for large CNVs rather than correcting post-hoc
  • Revisit conversion of some INS variants to CPX
  • Add more thorough assessment of PE evidence for CTX and CPX including dispersion, interruptions, etc

Testing

  • Validate all JSONs with womtool & Terra validation script, and verify that JSONs are present for FilterGenotypes, CleanVcf, and RefineComplexVariants (FilterGenotypes JSON was not produced after rebase)
  • Tested RefineComplexVariants on 1KG Dragen callset in Terra with only inputs produced by previous steps of GATK-SV. It succeeded and the outputs were as expected: site counts changed only for INS (a subset converted to CPX) and CPX (some gained from INS conversions, some filtered as UNRESOLVED), and per-sample counts of CPX and CTX decreased as shown below
image image image image
  • Tested CleanVcf on the reference panel on cromwell and verified that DEL:ME records were produced
  • Tested FilterGenotypes on the reference panel on cromwell and verified that QUAL was now in [0,99] and empty filter statuses had been converted to PASS

Ongoing testing [Complete]

  • Re-testing RefineComplexVariants with updated docker image (previously tested with script overrides) [Complete]
  • Re-testing RefineComplexVariants with no-call caching for cost/runtime estimate and resource monitoring [Complete, cost $9.94 and ran for about 7 hours for ~3k samples]
  • Re-testing RefineComplexVariants after linting changes [Complete]
  • Testing FilterGenotypes on outputs of RefineComplexVariants [Complete]

@mwalker174
Copy link
Collaborator

Thank you! Yes to all three questions, I think.

Comment on lines +64 to +68
16. `16-RefineComplexVariants`: Complex variant and translocation refinement
17. `17-JoinRawCalls`: Combines unfiltered calls (from step 5) across batches
18. `18-SVConcordance`: Annotates variants with genotype concordance against raw calls
19. `19-FilterGenotypes`: Performs genotype filtering to improve precision and generates QC plots
20. `20-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to rebase since the FilterGenotypes documentation is merged now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased yesterday after you merged - I didn't add these lines from scratch, just updated the numbering

for info in cpx_sv:
breakpoints = info[0]
if info[1][0] == 'ab' and info[1][1] == 'b^': # delINV
common_1 = ['tabix', 'PE_metrics', breakpoints[0] + ':' + str(breakpoints[1] - flank_back) + '-' + str(breakpoints[1] + flank_front), '| grep', 'sample', '| awk', "'{if ($1==$4", '&&', '$3=="+" && $6=="+"', '&&', '$5>' + str(breakpoints[3] - flank_back), '&&', '$5<' + str(breakpoints[3] + flank_front), ") print}' | sed -e 's/$/\\t", info[2], "/'", '>>', pe_evidence]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☠️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah...

Comment on lines 623 to 626
# disk is cheap, read/write speed is proportional to disk size, and disk IO is a significant time factor:
# in tests on large VCFs, memory usage is ~1.0 * input VCF size
# the biggest disk usage is at the end of the task, with input + output VCF on disk
Int cpu_cores = 2 # speed up compression / decompression of VCFs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this all relevant or just copy+paste? I'm not sure if multiple cores helps unless they're explicitly requested in the CLI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was just copy/paste, I can change to 1 CPU

Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge assuming FilterGenotypes tests complete.

@epiercehoffman epiercehoffman merged commit d532812 into main Oct 29, 2024
9 checks passed
@epiercehoffman epiercehoffman deleted the eph_refine_cpx branch October 29, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants