-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool to detect CRAM base corruption caused by GATK issue 8768 #8819
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
b33e1b7
Cloudify the existing file diagnostics framework.
cmnbroad b43709a
8768 file diagnostics.
cmnbroad b4c1855
Add .tsv outputs, exit code, echo output.
cmnbroad d3970e5
Check in missing .tsv test cases.
cmnbroad 6d67d4e
Fix reference path.
cmnbroad def1c7c
Address review comments
droazen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
144 changes: 144 additions & 0 deletions
144
src/main/java/org/broadinstitute/hellbender/tools/CRAMIssue8768Detector.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
package org.broadinstitute.hellbender.tools; | ||
|
||
import org.broadinstitute.barclay.argparser.*; | ||
import org.broadinstitute.hellbender.tools.filediagnostics.CRAMIssue8768Analyzer; | ||
import org.broadinstitute.hellbender.cmdline.CommandLineProgram; | ||
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions; | ||
import org.broadinstitute.hellbender.engine.GATKPath; | ||
import picard.cmdline.programgroups.OtherProgramGroup; | ||
|
||
/** | ||
* A diagnostic tool that analyzes a CRAM file to look for possible base corruption caused by | ||
* <a href="https://github.com/broadinstitute/gatk/issues/8768">GATK issue 8768</a>. | ||
* | ||
* <p>This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0.</p> | ||
* | ||
* <p>This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0.</p> | ||
* | ||
* <p>The bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, | ||
* and both of the following conditions are met:</p> | ||
* | ||
* <ul> | ||
* <li>At least one read is mapped to the very first base of a reference contig</li> | ||
* <li>The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig</li> | ||
* </ul> | ||
* | ||
* <p>When both of these conditions are met, the resulting CRAM file may have corrupt containers containing reads | ||
* with an incorrect sequence.</p> | ||
* | ||
* <p>This tool writes a report to an output text file indicating whether the CRAM file appears to have read base corruption caused by issue 8768, | ||
* and listing the affected containers. By default, the output report will have a summary of the average mismatch rate for all suspected bad containers | ||
* and a few presumed good containers in order to determine if there is a large difference in the base mismatch rate.</p> | ||
* | ||
* <p>Optionally, a TSV file with the same information as the textual report, but in tabular form, can be written | ||
* using the "--output-tsv" argument.</p> | ||
* | ||
* <p>To analyze the base mismatch rate for ALL containers, use the "verbose" option.</p> | ||
* | ||
* <p>Works on files ending in .cram.</p> | ||
* <br /> | ||
* | ||
* <h4>Sample Usage:</h4> | ||
* <pre> | ||
* gatk CRAMIssue8768Detector \ | ||
* -I input.cram \ | ||
* -O output_report.txt \ | ||
* -R reference.fasta | ||
* </pre> | ||
* <pre> | ||
* gatk CRAMIssue8768Detector \ | ||
* -I input.cram \ | ||
* -O output_report.txt \ | ||
* -R reference.fasta \ | ||
* --output-tsv output_report_as_table.tsv | ||
* </pre> | ||
*/ | ||
@ExperimentalFeature | ||
@WorkflowProperties | ||
@CommandLineProgramProperties( | ||
summary = "Analyze a CRAM file to check for base corruption caused by GATK issue 8768", | ||
oneLineSummary = "Analyze a CRAM file to check for base corruption caused by GATK issue 8768", | ||
programGroup = OtherProgramGroup.class | ||
) | ||
public class CRAMIssue8768Detector extends CommandLineProgram { | ||
// default average mismatch rate threshold above which we consider the file to be corrupt | ||
private static final double DEFAULT_MISMATCH_RATE_THRESHOLD = 0.05; | ||
|
||
@Argument(fullName = StandardArgumentDefinitions.INPUT_LONG_NAME, | ||
shortName = StandardArgumentDefinitions.INPUT_SHORT_NAME, | ||
doc = "Input path of CRAM file to analyze", | ||
common = true) | ||
@WorkflowInput | ||
public GATKPath inputPath; | ||
|
||
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, | ||
shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, | ||
doc = "Output diagnostics text file", | ||
common = true) | ||
@WorkflowOutput | ||
public GATKPath textOutputPath; | ||
|
||
public static final String OUTPUT_TSV__ARG_NAME = "output-tsv"; | ||
@Argument(fullName = OUTPUT_TSV__ARG_NAME, | ||
shortName = OUTPUT_TSV__ARG_NAME, | ||
doc = "Output diagnostics tsv file", | ||
optional = true) | ||
@WorkflowOutput | ||
public GATKPath tsvOutputPath; | ||
|
||
@Argument(fullName = StandardArgumentDefinitions.REFERENCE_LONG_NAME, | ||
shortName = StandardArgumentDefinitions.REFERENCE_SHORT_NAME, | ||
doc = "Reference for the CRAM file", | ||
common = true) | ||
@WorkflowOutput | ||
public GATKPath referencePath; | ||
|
||
public static final String MISMATCH_RATE_THRESHOLD_ARG_NAME = "mismatch-rate-threshold"; | ||
@Argument(fullName = MISMATCH_RATE_THRESHOLD_ARG_NAME, | ||
shortName = MISMATCH_RATE_THRESHOLD_ARG_NAME, | ||
doc = "Mismatch rate threshold above which we consider the file to be corrupt", | ||
optional = true) | ||
public double mismatchRateThreshold = DEFAULT_MISMATCH_RATE_THRESHOLD; | ||
|
||
public static final String VERBOSE_ARG_NAME = "verbose"; | ||
@Argument(fullName = VERBOSE_ARG_NAME, | ||
shortName= VERBOSE_ARG_NAME, | ||
doc="Calculate and print the mismatch rate for all containers", | ||
optional=true) | ||
public boolean verbose = false; | ||
|
||
public static final String ECHO_ARG_NAME = "echo-to-stdout"; | ||
@Argument(fullName = ECHO_ARG_NAME, | ||
shortName= ECHO_ARG_NAME, | ||
doc="Echo text output to stdout", | ||
optional=true) | ||
public boolean echoToStdout = false; | ||
|
||
private CRAMIssue8768Analyzer cramAnalyzer; | ||
|
||
@Override | ||
protected Object doWork() { | ||
cramAnalyzer = new CRAMIssue8768Analyzer( | ||
inputPath, | ||
textOutputPath, | ||
tsvOutputPath, | ||
referencePath, | ||
mismatchRateThreshold, | ||
verbose, | ||
echoToStdout); | ||
cramAnalyzer.doAnalysis(); | ||
return cramAnalyzer.getRetCode(); | ||
} | ||
|
||
@Override | ||
protected void onShutdown() { | ||
if ( cramAnalyzer != null ) { | ||
try { | ||
cramAnalyzer.close(); | ||
} catch (Exception e) { | ||
throw new RuntimeException(e); | ||
} | ||
} | ||
} | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,16 +11,17 @@ | |
*/ | ||
public class BAIAnalyzer extends HTSAnalyzer { | ||
|
||
public BAIAnalyzer(final GATKPath inputPath, final File outputFile) { | ||
super(inputPath, outputFile); | ||
public BAIAnalyzer(final GATKPath inputPath, final GATKPath outputPath) { | ||
super(inputPath, outputPath); | ||
} | ||
|
||
/** | ||
* Run the analyzer for the file. | ||
*/ | ||
protected void doAnalysis() { | ||
System.out.println(String.format("\nOutput written to %s\n", outputFile)); | ||
BAMIndexer.createAndWriteIndex(inputPath.toPath().toFile(), outputFile, true); | ||
System.out.println(String.format("\nOutput written to %s\n", outputPath)); | ||
// note this method is not nio aware | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should probably note this limitation in the class-level docs as well |
||
BAMIndexer.createAndWriteIndex(inputPath.toPath().toFile(), new File(outputPath.getRawInputString()), true); | ||
} | ||
|
||
@Override | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a 1-2 paragraph description of the bug here, including under what circumstances it gets triggered, how it manifests, and what versions of GATK/Picard it affects. We should also post this information in a comment to issue 8768.