Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting methylation data from aligned BAMs #63

Open
jameson-orvis opened this issue Jun 19, 2024 · 16 comments
Open

Extracting methylation data from aligned BAMs #63

jameson-orvis opened this issue Jun 19, 2024 · 16 comments
Labels
question Further information is requested

Comments

@jameson-orvis
Copy link

Ask away!

Hi! Quick question on extracting methylation data from aligned reads. Our raw bams have been methylation called already, so it seems like the easiest way to extract methylation information would be to use modkit on the aligned bams at {alias}.ns.bam. However, using "modkit pileup" gives errors of the form

failed to get modbase info for record 1977e594-f368-4232-a0de-cb0dea59e5b6:0079:0393, Bad Input: ML tag malformed

The code for pore-c-py seems like it is designed to pass modified bam information to the digested aligned output bams, so I am confused as to why this is happening. It may be an issue with the way I am calling modkit, but before I attempt troubleshooting further I want to ask if it should be possible to use modkit to extract methylation data from BAMs aligned from the pipeline, or if there is another preferred I should be processing methylation calls. Thank you!

@jameson-orvis jameson-orvis added the question Further information is requested label Jun 19, 2024
@jameson-orvis jameson-orvis changed the title Extracting methylation data from methylation from aligned BAMs Extracting methylation data from aligned BAMs Jun 20, 2024
@sarahjeeeze
Copy link
Contributor

Hi thanks for letting us know this, the ML tag is not meant to be malformed we will investigate and get back to you

@Henrikkoe
Copy link

Hi! I am also interested in extracting the methylation information and was wondering if there is anything new about this topic? I realised that you are using the ML:Z: tag but couldnt find any description for this in the SAM format specification. Could you maybe commend why you use this tag and not the ML:B:C?

@cjw85
Copy link
Contributor

cjw85 commented Jul 15, 2024

Just looking at the code, I think you've hit on the error @Henrikkoe. The code was adapted to use pysam from code that directly wrote SAM text. The value of the tag is being created as a string and passed to pysam which gladly writes the string as a Z format tag rather than a B format tag. We will need to change this.

@Henrikkoe
Copy link

I just quickly changed the tags from ML:Z: to ML:B:C for a test set and got this error for "modkit pileup":

failed to get modbase info for record 55d81b4f-3394-4430-b554-1ada22653566:0477:0802, Bad Input: MN tag length 4874 and seq length 325 don't match

So I think for each alignment within a concatemer the MN tag should be updated according the alignment length.

@cjw85
Copy link
Contributor

cjw85 commented Jul 17, 2024

Thanks @Henrikkoe,

That sounds like a secondary issue, after chopping up the read we need to invalidate the MN tag. This is an additional tag that was added to catch exactly the situation where a process might have trimmed a read but not taken care to trim MM and ML!

You should be able to test this further for us. The pore_c_py digest command has an option --remove_tags which could be added to the nextflow code to drop the tag from the trimmed reads: https://github.com/epi2me-labs/wf-pore-c/blob/master/modules/local/pore-c.nf#L65

We should be able to recalculate and amend MN (worst case we can discard it) for the monomer reads.

@Henrikkoe
Copy link

Henrikkoe commented Jul 19, 2024

I did a small check and the --remove_tags option seems to remove as well the ML and MM tags. Also when specifying the MN tag.

@cjw85
Copy link
Contributor

cjw85 commented Jul 19, 2024

I may have confused things: yesterday I made a release of pore-c-py that ensures MN is updated correctly. In the process I changed the behaviour to consistently remove all the mod tags if one is removed (because it makes no sense to leave them in a partial state).

@Henrikkoe
Copy link

Henrikkoe commented Jul 22, 2024

I am still a bit confused about the tags that are removed and added at which time during processing. If I understood it correctly then is the input for pore-c-py digest an unaligned bam file with the called base modifications. At this moment in the pipeline there was no alignment and only the digest of the concatemer into monomers. These monomers can be alignments later in the processing and basically representing a partial state/alignment of the concatemer. Then the sequence/alignment length (MN) should be updated according to the new length. This is what I dont see in the bam file entries but I see (except of the ML:Z: to ML:B:C tag label) a correct representation of ML and MM tags. So the ML and MM tags are trimmed correctly. Only all alignments of a concatemer then have a wrong MN tag.

Here a truncated representation of a bam file entry:
653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:2198
But I think it should be:
653cba6d-c7e7-45bc-a49c-0011e16626e5:0717:1104 MN:i:387

@cjw85
Copy link
Contributor

cjw85 commented Jul 22, 2024

You are correct: in the current release the MN tag is not being updated correctly (and the ML tag is a string tag not and array tag). We need to make a new release to correct these two issues.

We've corrected the code in the pore-c-py package but not updated wf-pore-c to use that new code.

@Henrikkoe
Copy link

Ah okay I see, thank you! Do you already have an estimate when the wf-pore-c will be updated?

@sarahjeeeze
Copy link
Contributor

Hi, this is updated now in the latest release v1.2.0

@jameson-orvis
Copy link
Author

Thank you so much!!!

@Henrikkoe
Copy link

Awesome! Thanks

@Henrikkoe
Copy link

Henrikkoe commented Aug 2, 2024

It seems that there is still an issue with a not correctly trimmed ML and MM tag for some reads/alignments.

5c0b5e95-4016-4c3e-9d36-227502762cd4:0099:0462 MN:i:363 correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:0600:0658 MN:i:6040 not trimmed ML and MM, wrong MN tag
5c0b5e95-4016-4c3e-9d36-227502762cd4:2568:2748 MN:i:180 correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:5400:5789 MN:i:389 correct
5c0b5e95-4016-4c3e-9d36-227502762cd4:4169:4457 MN:i:288 correct
etc.

@sarahjeeeze
Copy link
Contributor

Oh no sorry, I think we missed a capitalisation. Will amend again and let you know when ready.

@sarahjeeeze
Copy link
Contributor

Hi, this has now been released in version v1.2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

4 participants