Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching sequences align as 2 indels in paf CIGAR string #63

Open
JingaJenga opened this issue Jan 18, 2024 · 13 comments
Open

Matching sequences align as 2 indels in paf CIGAR string #63

JingaJenga opened this issue Jan 18, 2024 · 13 comments

Comments

@JingaJenga
Copy link

Hi,

I'm observing an apparent bug in which AnchorWave produces paf output with a CIGAR string that reports perfectly matching sequences as containing a matching insertion and deletion.

For example, see the assembly fasta file MHC.diploid_assembly.fasta. (Note: all files here need to be gunzipped before being used.) It contains two haplotype sequences, which I align to each other using AnchorWave and the anchors file MHC.anchors.fa. There's one stretch of 190 bp where the sequences match perfectly (MHC.1:256393-256582 = MHC.2:255950-256139). But when I align them, I get the output file MHC.2_to_MHC.1.paf. Its CIGAR string in this position, instead of the expected 190=, is 21=8I8D161=. The 8-bp identical sequence that is marked as 8I8D is MHC.1:256414-256421 = MHC.2:255971-255978 = CAAACAAA.

I've observed this same behavior in some other cases too. The CIGAR string contains an unexpected NIND pattern, with N = an integer usually lower than 8.

I'm using anchorwave v1.2.1 and paftools.js version 2.26-r1175.

Thank you in advance for any help you can provide!

@baoxingsong
Copy link
Owner

Thanks so much for your detailed feedback.
May I have all the commands from minimap2 mapping and your minimap2 version information, please?

@JingaJenga
Copy link
Author

Sure! I'm using minimap version 2.24-r1122. Here's my minimap2 command: minimap2 -ax splice:hq -N10 --MD <target.fasta> <source.fasta>

Here's the output bam file that corresponds to the paf file I sent before: MHC.2_to_MHC.1.bam.gz. It contains the same weird CIGAR pattern of 21=8I8D161= instead of 190=.

@baoxingsong
Copy link
Owner

Thanks, but may I have you GFF(3) file please? I need to repeat your results to further debug the problem.

@JingaJenga
Copy link
Author

Here's MHC.gff.gz. This is the gff file I used to derive the anchors fasta file I sent earlier (MHC.anchors.fa). It's a subset of the genome-wide gff file at https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz.

@baoxingsong
Copy link
Owner

Hi, I have not succeed to repeat your results.
For example, there is sequence named as "LOC105379664" in MHC.anchors.fa, while the GFF file tells me it is not a coding gene.
Could kindly share all your commands please?

@JingaJenga
Copy link
Author

Is there a reason why I should not be including non-coding genes in my anchors fasta file? Right now I'm including the CDS sequence if it's available but also including genes that don't have CDS sequences.

I'm also unsure what exactly this has to do with the alignment problem. Shouldn't the alignment process produce clean CIGAR strings regardless of whether the sequence is part of a CDS?

@baoxingsong
Copy link
Owner

Our initial idea was the coding gene is more conserved than non-coding genes. But including non-coding genes should not matter a lot.
If you did not use the anchowave gff2seq function to generate the MHC.anchors.fa file, it matters. AnchorWave ignores short CDS/exon records.
If you tuned the parameter of ancho-wave gff2seq to generate the MHC.anchors.fa file, then should turn the alignment parameters to make sure it parses the GFF file in the same as gff2seq.

@JingaJenga
Copy link
Author

Aha - thank you for that tip! You're absolutely right, I did not use anchorwave gff2seq to make my MHC.anchors.fa file. Instead I parsed the GFF manually. I did this because I'm analyzing a genomic subregion (MHC, ~5 Mbp) rather than a whole genome, so I needed to include as many anchors as possible, even non-coding genes.

As you can see from my MHC.anchors.fa file, the sequence for LOC105379664 is fairly long at 3,973 bp, but some of my other sequences are much shorter - one (MIR6833) is only 61 bp. I can imagine how that might cause problems. Do you have a recommendation for minimum anchor sequence length?

@baoxingsong
Copy link
Owner

In the AnchorWave pipeline, we used minimap2 to map the reference CDS to the reference and query genome.
Any reference full-length CDS mapping to a reference genome position different from the GFF record would be excluded for collinear analysis.
The minimap2 has a problem dealing with short CDS (a single CDS record, not the whole transcript). AnchorWave ignores those short (<20bp by default, tuned using the -m parameter) CDS records.

@JingaJenga
Copy link
Author

Thanks, this is really helpful to know! I tried tweaking my script to generate anchors files, so that it ignores exons and CDS's of length <20bp. This caused some of the anchor sequences to become shorter. But it didn't remove any anchor sequences entirely, and the anchors file still caused the same alignment error I saw before (the 21=8I8D161= in the CIGAR string.)

Do you have a sense of what's causing the alignment error? Is it inside minimap2, perhaps?

@baoxingsong
Copy link
Owner

I would need to repeat your results so I can help with debugging.

@JingaJenga
Copy link
Author

Which results are you trying to repeat? The creation of the anchors fasta file from the GFF, or the creation of the alignment paf file using anchorwave genoAli?

@baoxingsong
Copy link
Owner

I need every command, including data downloading, to repeat your results. The anchorwave genoAli also needs the GFF file as input. I could not understand why you would like to write a script to generate the CDS.fa file.
What kind of GFF was used as an input for anchorwave genoAli, then?
You should use the identical GFF and genome file as input for anchorwave gff2seq and anchorwave genoAli, and parse GFF file using identical parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants