PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

ilyasoifer · 2024-04-09T09:36:07Z

Bug Report

Affected tool(s) or class(es)

gatk PrintReads, picard MarkDuplicates

Affected version(s)

4.4.0.0 (htsjdk 3.0.5, picard 3.0.0)

Description

I apologize in advance if this issue belongs to htsjdk. When we work with some of the CRAMs and pass them through PrintReads or picard MarkDuplicates, "N" bases get introduced.

We think that the problem happens when PrintReads write the CRAM rather than reading it, because if the output of PrintReads is a BAM, it does not happen.

We also noticed that this issue does not happen with earlier GATK (4.2.6.1), HTSJDK 2.24.1.

Happy to share the input files

Steps to reproduce

gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.cram -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta

gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.bam -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta

samtools view ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	AS:i:54	X1:i:0	XD:Z:CATGAGG_GGTGATC	a3:i:92	bi:Z:5346565226	f1:Z:CATGAGG	f2:Z:GGTGATC	i1:Z:Q44	i2:Z:Q27	pr:i:22	pt:i:15	px:i:3813	py:i:1262	rq:f:0.03	si:i:3750	tm:Z:AQtq:i:195	MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0	NM:i:54	RG:Z:Z0011


samtools view ultMerge.mt.gatk_printreads.bam | grep "038958_1-Z0011-5346565226"

038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	X1:i:0	f1:Z:CATGAGG	i1:Z:Q44	f2:Z:GGTGATC	i2:Z:Q27	a3:i:92	XD:Z:CATGAGG_GGTGATC	RG:Z:Z0011	AS:i:54	bi:Z:5346565226	si:i:3750	tm:Z:AQ	rq:f:0.03	tq:i:195	pr:i:22pt:i:15	px:i:3813	py:i:1262

Expected behavior

BAM and CRAM outputs should behave the same

Actual behavior

BAM and CRAM outputs behave differently

The text was updated successfully, but these errors were encountered:

gokalpcelik · 2024-04-09T12:04:42Z

These reads don't even look the same ?

GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT
TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA

ilyasoifer · 2024-04-09T12:18:47Z

That is true! Should I update the title of the issue to reflect this?

gokalpcelik · 2024-04-09T12:27:35Z

Can you check the REF_CACHE? Or alternatively you may provide samtools the reference fasta during cram view.

ilyasoifer · 2024-04-09T12:42:39Z

Thanks @gokalpcelik!
Seems that -T does not affect much:

samtools view -T /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	AS:i:54	X1:i:0	XD:Z:CATGAGG_GGTGATC	a3:i:92	bi:Z:5346565226f1:Z:CATGAGG	f2:Z:GGTGATC	i1:Z:Q44	i2:Z:Q27	pr:i:22	pt:i:15	px:i:3813	py:i:1262	rq:f:0.03	si:i:3750	tm:Z:AQ	tq:i:195	MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0	NM:i:54	RG:Z:Z0011

cmnbroad · 2024-04-09T13:01:09Z

@ilyasoifer Is there any way I can access the original cram (or better yet, a small subset thereof consisting of just MT) that illustrates this issue) and the reference ? It might be hard to debug without that. If thats not possible, a few suggestions: can you try using PrintReads to write the original cram (I would try just MT) first to a cram, then to a sam, and also the original cram to a sam, and see how those compare? It would also be useful to see what that read looks like if you use samtools view on the ORIGINAL cram. Do you know what software/version was used to write the original cram ?

ilyasoifer · 2024-04-09T13:13:12Z

Hi @cmnbroad - thanks! I can upload to one of the buckets that are shared between Ultima and the Broad.
Could you provide your gcp account so I can give you permissions?
I am guessing that Megan Shand can share it through our joint slack channel if you prefer.

The original file was created using samtools v1.17.
I provided above the result of writing BAM and Cram with PrintReads. Is it helpful or you prefer SAM? And if I just do
samtools view ultMerge.mt.cram I get

038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	bi:Z:5346565226	rq:f:0.03	pr:i:22	pt:i:15	px:i:3813	py:i:1262	si:i:3750	tq:i:195	tm:Z:AQ	i1:Z:Q44	f1:Z:CATGAGG	i2:Z:Q27	f2:Z:GGTGATC	a3:i:92	XD:Z:CATGAGG_GGTGATC	X1:i:0	AS:i:54	MD:Z:54	NM:i:0	RG:Z:Z0011

cmnbroad · 2024-04-09T13:30:06Z

@ilyasoifer [email protected]. And don't worry about doing the PrintReads conversions I requested - if I have access to the original file and the reference I can debug this directly.

ilyasoifer · 2024-04-09T14:13:34Z

@cmnbroad, OK, great. Shared the files with Megan, she will transfer them over!

ilyasoifer · 2024-04-09T14:58:26Z

@cmnbroad - the reference is from here: gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/

cmnbroad · 2024-04-09T22:09:01Z

Thanks for reporting this @ilyasoifer - it looks like a serious bug. I see whats going on and am working on a fix.

ilyasoifer · 2024-04-10T05:15:46Z

Thank you, good to hear that it was useful

ilyasoifer · 2024-05-10T09:48:16Z

@cmnbroad - hope all is well. I was wondering if there is any ETA when the fix that you made will be released?

Thanks!
Ilya

droazen · 2024-05-10T18:09:49Z

@ilyasoifer It will be part of the GATK 4.6 release, which should come out this month. We've also implemented a cram scanning tool that users can use to scan their crams to see if any are affected.

ilyasoifer · 2024-05-12T08:24:28Z

Ah, great, thanks for updating me!

droazen · 2024-06-29T19:22:40Z

Summary information about this bug:

This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0. The PR with the fix is: #8900

This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0.

This bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met:

At least one read is mapped to the very first base of a reference contig
The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig

When both of these conditions are met, the resulting CRAM file may have corrupt containers associated with that contig containing reads with an incorrect sequence.

Since many common references such as hg38 have N's at the very beginning of the autosomes and X/Y, many pipelines will not be affected by this bug. However, users of a telomere-to-telomere reference, users doing mitochondrial calling, and users with reads aligned to the alt sequences will want to scan their CRAM files for possible corruption.

The other mitigating circumstance is that when a CRAM is affected, the signal will be overwhelmingly obvious, with the mismatch rate typically jumping from sub-1% to 80-90% for the affected regions, making it likely to be caught by standard QC processes.

A CRAM scanning tool called CRAMIssue8768Detector that can detect whether a particular CRAM file is affected by this bug was added in #8819, and was released as part of GATK 4.6.0.0

Included a unit test to check for the presence of the fix in HTSJDK 4.1.1 for the CRAM base corruption bug reported in #8768 Resolves #8768

A new diagnostic tool, CRAMIssue8768Detector, that analyzes a CRAM file to look for possible base corruption caused by #8768 This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0. This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0. The bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met: * At least one read is mapped to the very first base of a reference contig * The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig When both of these conditions are met, the resulting CRAM file may have corrupt containers containing reads with an incorrect sequence. This tool writes a report to an output text file indicating whether the CRAM file appears to have read base corruption caused by issue 8768, and listing the affected containers. By default, the output report will have a summary of the average mismatch rate for all suspected bad containers and a few presumed good containers in order to determine if there is a large difference in the base mismatch rate. Optionally, a TSV file with the same information as the textual report, but in tabular form, can be written using the "--output-tsv" argument. --------- Co-authored-by: David Roazen <[email protected]>

ilyasoifer assigned meganshand Apr 9, 2024

ilyasoifer changed the title ~~PrintReads introduces N bases when encoding some CRAMs~~ PrintReads introduces N bases when encoding some CRAMs and changes sequence Apr 9, 2024

droazen assigned cmnbroad Apr 9, 2024

droazen added PRIORITY_HIGH cram labels Apr 9, 2024

cmnbroad mentioned this issue Apr 9, 2024

Fix CRAMReferenceRegion updating. samtools/htsjdk#1708

Merged

droazen self-assigned this May 22, 2024

droazen added a commit that referenced this issue Jun 29, 2024

Update HTSJDK to 4.1.1 and Picard to 3.2.0

a070c47

Included a unit test to check for the presence of the fix in HTSJDK 4.1.1 for the CRAM base corruption bug reported in #8768 Resolves #8768

droazen mentioned this issue Jun 29, 2024

Update HTSJDK to 4.1.1 and Picard to 3.2.0 #8900

Merged

droazen closed this as completed in #8900 Jun 29, 2024

droazen closed this as completed in 64348bc Jun 29, 2024

broadinstitute deleted a comment Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

ilyasoifer commented Apr 9, 2024

gokalpcelik commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

gokalpcelik commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

cmnbroad commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

cmnbroad commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

cmnbroad commented Apr 9, 2024

ilyasoifer commented Apr 10, 2024

ilyasoifer commented May 10, 2024

droazen commented May 10, 2024

ilyasoifer commented May 12, 2024

droazen commented Jun 29, 2024 •

edited

Loading

PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

Comments

ilyasoifer commented Apr 9, 2024

Bug Report

Affected tool(s) or class(es)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

gokalpcelik commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

gokalpcelik commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

cmnbroad commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

cmnbroad commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

ilyasoifer commented Apr 9, 2024

cmnbroad commented Apr 9, 2024

ilyasoifer commented Apr 10, 2024

ilyasoifer commented May 10, 2024

droazen commented May 10, 2024

ilyasoifer commented May 12, 2024

droazen commented Jun 29, 2024 • edited Loading

droazen commented Jun 29, 2024 •

edited

Loading