Extracting methylation data from aligned BAMs #63

jameson-orvis · 2024-06-19T21:05:52Z

Ask away!

Hi! Quick question on extracting methylation data from aligned reads. Our raw bams have been methylation called already, so it seems like the easiest way to extract methylation information would be to use modkit on the aligned bams at {alias}.ns.bam. However, using "modkit pileup" gives errors of the form

failed to get modbase info for record 1977e594-f368-4232-a0de-cb0dea59e5b6:0079:0393, Bad Input: ML tag malformed

The code for pore-c-py seems like it is designed to pass modified bam information to the digested aligned output bams, so I am confused as to why this is happening. It may be an issue with the way I am calling modkit, but before I attempt troubleshooting further I want to ask if it should be possible to use modkit to extract methylation data from BAMs aligned from the pipeline, or if there is another preferred I should be processing methylation calls. Thank you!

sarahjeeeze · 2024-06-21T09:20:51Z

Hi thanks for letting us know this, the ML tag is not meant to be malformed we will investigate and get back to you

Henrikkoe · 2024-07-15T15:55:59Z

Hi! I am also interested in extracting the methylation information and was wondering if there is anything new about this topic? I realised that you are using the ML:Z: tag but couldnt find any description for this in the SAM format specification. Could you maybe commend why you use this tag and not the ML:B:C?

cjw85 · 2024-07-15T19:13:45Z

Just looking at the code, I think you've hit on the error @Henrikkoe. The code was adapted to use pysam from code that directly wrote SAM text. The value of the tag is being created as a string and passed to pysam which gladly writes the string as a Z format tag rather than a B format tag. We will need to change this.

Henrikkoe · 2024-07-16T15:18:56Z

I just quickly changed the tags from ML:Z: to ML:B:C for a test set and got this error for "modkit pileup":

failed to get modbase info for record 55d81b4f-3394-4430-b554-1ada22653566:0477:0802, Bad Input: MN tag length 4874 and seq length 325 don't match

So I think for each alignment within a concatemer the MN tag should be updated according the alignment length.

cjw85 · 2024-07-17T11:15:27Z

Thanks @Henrikkoe,

That sounds like a secondary issue, after chopping up the read we need to invalidate the MN tag. This is an additional tag that was added to catch exactly the situation where a process might have trimmed a read but not taken care to trim MM and ML!

You should be able to test this further for us. The pore_c_py digest command has an option --remove_tags which could be added to the nextflow code to drop the tag from the trimmed reads: https://github.com/epi2me-labs/wf-pore-c/blob/master/modules/local/pore-c.nf#L65

We should be able to recalculate and amend MN (worst case we can discard it) for the monomer reads.

Henrikkoe · 2024-07-19T17:55:45Z

I did a small check and the --remove_tags option seems to remove as well the ML and MM tags. Also when specifying the MN tag.

cjw85 · 2024-07-19T18:07:53Z

I may have confused things: yesterday I made a release of pore-c-py that ensures MN is updated correctly. In the process I changed the behaviour to consistently remove all the mod tags if one is removed (because it makes no sense to leave them in a partial state).

Henrikkoe · 2024-07-22T18:28:34Z

I am still a bit confused about the tags that are removed and added at which time during processing. If I understood it correctly then is the input for pore-c-py digest an unaligned bam file with the called base modifications. At this moment in the pipeline there was no alignment and only the digest of the concatemer into monomers. These monomers can be alignments later in the processing and basically representing a partial state/alignment of the concatemer. Then the sequence/alignment length (MN) should be updated according to the new length. This is what I dont see in the bam file entries but I see (except of the ML:Z: to ML:B:C tag label) a correct representation of ML and MM tags. So the ML and MM tags are trimmed correctly. Only all alignments of a concatemer then have a wrong MN tag.

Here a truncated representation of a bam file entry:
653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:2198
But I think it should be:
653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:387

cjw85 · 2024-07-22T18:35:14Z

You are correct: in the current release the MN tag is not being updated correctly (and the ML tag is a string tag not and array tag). We need to make a new release to correct these two issues.

We've corrected the code in the pore-c-py package but not updated wf-pore-c to use that new code.

Henrikkoe · 2024-07-23T14:41:28Z

Ah okay I see, thank you! Do you already have an estimate when the wf-pore-c will be updated?

sarahjeeeze · 2024-07-29T15:44:35Z

Hi, this is updated now in the latest release v1.2.0

jameson-orvis · 2024-07-29T16:13:55Z

Thank you so much!!!

Henrikkoe · 2024-07-29T16:50:59Z

Awesome! Thanks

Henrikkoe · 2024-08-02T14:25:49Z

It seems that there is still an issue with a not correctly trimmed ML and MM tag for some reads/alignments.

5c0b5e95-4016-4c3e-9d36-227502762cd4:0099:0462 MN:i:363 correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:0600:0658 MN:i:6040 not trimmed ML and MM, wrong MN tag
5c0b5e95-4016-4c3e-9d36-227502762cd4:2568:2748 MN:i:180 correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:5400:5789 MN:i:389 correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:4169:4457 MN:i:288 correct
etc.

sarahjeeeze · 2024-08-09T16:53:31Z

Oh no sorry, I think we missed a capitalisation. Will amend again and let you know when ready.

sarahjeeeze · 2024-08-14T16:09:49Z

Hi, this has now been released in version v1.2.2

jameson-orvis added the question Further information is requested label Jun 19, 2024

jameson-orvis changed the title ~~Extracting methylation data from methylation from aligned BAMs~~ Extracting methylation data from aligned BAMs Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting methylation data from aligned BAMs #63

Extracting methylation data from aligned BAMs #63

jameson-orvis commented Jun 19, 2024

sarahjeeeze commented Jun 21, 2024

Henrikkoe commented Jul 15, 2024

cjw85 commented Jul 15, 2024

Henrikkoe commented Jul 16, 2024

cjw85 commented Jul 17, 2024

Henrikkoe commented Jul 19, 2024 •

edited

Loading

cjw85 commented Jul 19, 2024

Henrikkoe commented Jul 22, 2024 •

edited

Loading

cjw85 commented Jul 22, 2024

Henrikkoe commented Jul 23, 2024

sarahjeeeze commented Jul 29, 2024

jameson-orvis commented Jul 29, 2024

Henrikkoe commented Jul 29, 2024

Henrikkoe commented Aug 2, 2024 •

edited

Loading

sarahjeeeze commented Aug 9, 2024

sarahjeeeze commented Aug 14, 2024

Extracting methylation data from aligned BAMs #63

Extracting methylation data from aligned BAMs #63

Comments

jameson-orvis commented Jun 19, 2024

Ask away!

sarahjeeeze commented Jun 21, 2024

Henrikkoe commented Jul 15, 2024

cjw85 commented Jul 15, 2024

Henrikkoe commented Jul 16, 2024

cjw85 commented Jul 17, 2024

Henrikkoe commented Jul 19, 2024 • edited Loading

cjw85 commented Jul 19, 2024

Henrikkoe commented Jul 22, 2024 • edited Loading

cjw85 commented Jul 22, 2024

Henrikkoe commented Jul 23, 2024

sarahjeeeze commented Jul 29, 2024

jameson-orvis commented Jul 29, 2024

Henrikkoe commented Jul 29, 2024

Henrikkoe commented Aug 2, 2024 • edited Loading

sarahjeeeze commented Aug 9, 2024

sarahjeeeze commented Aug 14, 2024

Henrikkoe commented Jul 19, 2024 •

edited

Loading

Henrikkoe commented Jul 22, 2024 •

edited

Loading

Henrikkoe commented Aug 2, 2024 •

edited

Loading