Matching sequences align as 2 indels in paf CIGAR string #63

JingaJenga · 2024-01-18T22:49:29Z

Hi,

I'm observing an apparent bug in which AnchorWave produces paf output with a CIGAR string that reports perfectly matching sequences as containing a matching insertion and deletion.

For example, see the assembly fasta file MHC.diploid_assembly.fasta. (Note: all files here need to be gunzipped before being used.) It contains two haplotype sequences, which I align to each other using AnchorWave and the anchors file MHC.anchors.fa. There's one stretch of 190 bp where the sequences match perfectly (MHC.1:256393-256582 = MHC.2:255950-256139). But when I align them, I get the output file MHC.2_to_MHC.1.paf. Its CIGAR string in this position, instead of the expected 190=, is 21=8I8D161=. The 8-bp identical sequence that is marked as 8I8D is MHC.1:256414-256421 = MHC.2:255971-255978 = CAAACAAA.

I've observed this same behavior in some other cases too. The CIGAR string contains an unexpected NIND pattern, with N = an integer usually lower than 8.

I'm using anchorwave v1.2.1 and paftools.js version 2.26-r1175.

Thank you in advance for any help you can provide!

The text was updated successfully, but these errors were encountered:

baoxingsong · 2024-01-19T03:04:13Z

Thanks so much for your detailed feedback.
May I have all the commands from minimap2 mapping and your minimap2 version information, please?

JingaJenga · 2024-01-19T03:40:35Z

Sure! I'm using minimap version 2.24-r1122. Here's my minimap2 command: minimap2 -ax splice:hq -N10 --MD <target.fasta> <source.fasta>

Here's the output bam file that corresponds to the paf file I sent before: MHC.2_to_MHC.1.bam.gz. It contains the same weird CIGAR pattern of 21=8I8D161= instead of 190=.

baoxingsong · 2024-01-22T02:56:41Z

Thanks, but may I have you GFF(3) file please? I need to repeat your results to further debug the problem.

JingaJenga · 2024-01-22T03:08:28Z

Here's MHC.gff.gz. This is the gff file I used to derive the anchors fasta file I sent earlier (MHC.anchors.fa). It's a subset of the genome-wide gff file at https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz.

baoxingsong · 2024-01-29T14:33:16Z

Hi, I have not succeed to repeat your results.
For example, there is sequence named as "LOC105379664" in MHC.anchors.fa, while the GFF file tells me it is not a coding gene.
Could kindly share all your commands please?

JingaJenga · 2024-01-30T04:28:42Z

Is there a reason why I should not be including non-coding genes in my anchors fasta file? Right now I'm including the CDS sequence if it's available but also including genes that don't have CDS sequences.

I'm also unsure what exactly this has to do with the alignment problem. Shouldn't the alignment process produce clean CIGAR strings regardless of whether the sequence is part of a CDS?

baoxingsong · 2024-01-30T05:51:29Z

Our initial idea was the coding gene is more conserved than non-coding genes. But including non-coding genes should not matter a lot.
If you did not use the anchowave gff2seq function to generate the MHC.anchors.fa file, it matters. AnchorWave ignores short CDS/exon records.
If you tuned the parameter of ancho-wave gff2seq to generate the MHC.anchors.fa file, then should turn the alignment parameters to make sure it parses the GFF file in the same as gff2seq.

JingaJenga · 2024-01-31T01:09:54Z

Aha - thank you for that tip! You're absolutely right, I did not use anchorwave gff2seq to make my MHC.anchors.fa file. Instead I parsed the GFF manually. I did this because I'm analyzing a genomic subregion (MHC, ~5 Mbp) rather than a whole genome, so I needed to include as many anchors as possible, even non-coding genes.

As you can see from my MHC.anchors.fa file, the sequence for LOC105379664 is fairly long at 3,973 bp, but some of my other sequences are much shorter - one (MIR6833) is only 61 bp. I can imagine how that might cause problems. Do you have a recommendation for minimum anchor sequence length?

baoxingsong · 2024-01-31T01:50:59Z

In the AnchorWave pipeline, we used minimap2 to map the reference CDS to the reference and query genome.
Any reference full-length CDS mapping to a reference genome position different from the GFF record would be excluded for collinear analysis.
The minimap2 has a problem dealing with short CDS (a single CDS record, not the whole transcript). AnchorWave ignores those short (<20bp by default, tuned using the -m parameter) CDS records.

JingaJenga · 2024-01-31T21:14:43Z

Thanks, this is really helpful to know! I tried tweaking my script to generate anchors files, so that it ignores exons and CDS's of length <20bp. This caused some of the anchor sequences to become shorter. But it didn't remove any anchor sequences entirely, and the anchors file still caused the same alignment error I saw before (the 21=8I8D161= in the CIGAR string.)

Do you have a sense of what's causing the alignment error? Is it inside minimap2, perhaps?

baoxingsong · 2024-02-01T03:19:00Z

I would need to repeat your results so I can help with debugging.

JingaJenga · 2024-02-01T04:47:54Z

Which results are you trying to repeat? The creation of the anchors fasta file from the GFF, or the creation of the alignment paf file using anchorwave genoAli?

baoxingsong · 2024-02-01T09:16:35Z

I need every command, including data downloading, to repeat your results. The anchorwave genoAli also needs the GFF file as input. I could not understand why you would like to write a script to generate the CDS.fa file.
What kind of GFF was used as an input for anchorwave genoAli, then?
You should use the identical GFF and genome file as input for anchorwave gff2seq and anchorwave genoAli, and parse GFF file using identical parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching sequences align as 2 indels in paf CIGAR string #63

Matching sequences align as 2 indels in paf CIGAR string #63

JingaJenga commented Jan 18, 2024

baoxingsong commented Jan 19, 2024

JingaJenga commented Jan 19, 2024

baoxingsong commented Jan 22, 2024

JingaJenga commented Jan 22, 2024

baoxingsong commented Jan 29, 2024

JingaJenga commented Jan 30, 2024

baoxingsong commented Jan 30, 2024

JingaJenga commented Jan 31, 2024

baoxingsong commented Jan 31, 2024

JingaJenga commented Jan 31, 2024

baoxingsong commented Feb 1, 2024

JingaJenga commented Feb 1, 2024

baoxingsong commented Feb 1, 2024

Matching sequences align as 2 indels in paf CIGAR string #63

Matching sequences align as 2 indels in paf CIGAR string #63

Comments

JingaJenga commented Jan 18, 2024

baoxingsong commented Jan 19, 2024

JingaJenga commented Jan 19, 2024

baoxingsong commented Jan 22, 2024

JingaJenga commented Jan 22, 2024

baoxingsong commented Jan 29, 2024

JingaJenga commented Jan 30, 2024

baoxingsong commented Jan 30, 2024

JingaJenga commented Jan 31, 2024

baoxingsong commented Jan 31, 2024

JingaJenga commented Jan 31, 2024

baoxingsong commented Feb 1, 2024

JingaJenga commented Feb 1, 2024

baoxingsong commented Feb 1, 2024