-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching sequences align as 2 indels in paf CIGAR string #63
Comments
Thanks so much for your detailed feedback. |
Sure! I'm using minimap version 2.24-r1122. Here's my minimap2 command: Here's the output bam file that corresponds to the paf file I sent before: MHC.2_to_MHC.1.bam.gz. It contains the same weird CIGAR pattern of |
Thanks, but may I have you GFF(3) file please? I need to repeat your results to further debug the problem. |
Here's MHC.gff.gz. This is the gff file I used to derive the anchors fasta file I sent earlier (MHC.anchors.fa). It's a subset of the genome-wide gff file at https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz. |
Hi, I have not succeed to repeat your results. |
Is there a reason why I should not be including non-coding genes in my anchors fasta file? Right now I'm including the CDS sequence if it's available but also including genes that don't have CDS sequences. I'm also unsure what exactly this has to do with the alignment problem. Shouldn't the alignment process produce clean CIGAR strings regardless of whether the sequence is part of a CDS? |
Our initial idea was the coding gene is more conserved than non-coding genes. But including non-coding genes should not matter a lot. |
Aha - thank you for that tip! You're absolutely right, I did not use As you can see from my |
In the AnchorWave pipeline, we used minimap2 to map the reference CDS to the reference and query genome. |
Thanks, this is really helpful to know! I tried tweaking my script to generate anchors files, so that it ignores exons and CDS's of length <20bp. This caused some of the anchor sequences to become shorter. But it didn't remove any anchor sequences entirely, and the anchors file still caused the same alignment error I saw before (the Do you have a sense of what's causing the alignment error? Is it inside minimap2, perhaps? |
I would need to repeat your results so I can help with debugging. |
Which results are you trying to repeat? The creation of the anchors fasta file from the GFF, or the creation of the alignment paf file using |
I need every command, including data downloading, to repeat your results. The |
Hi,
I'm observing an apparent bug in which AnchorWave produces paf output with a CIGAR string that reports perfectly matching sequences as containing a matching insertion and deletion.
For example, see the assembly fasta file MHC.diploid_assembly.fasta. (Note: all files here need to be gunzipped before being used.) It contains two haplotype sequences, which I align to each other using AnchorWave and the anchors file MHC.anchors.fa. There's one stretch of 190 bp where the sequences match perfectly (MHC.1:256393-256582 = MHC.2:255950-256139). But when I align them, I get the output file MHC.2_to_MHC.1.paf. Its CIGAR string in this position, instead of the expected
190=
, is21=8I8D161=
. The 8-bp identical sequence that is marked as8I8D
is MHC.1:256414-256421 = MHC.2:255971-255978 =CAAACAAA
.I've observed this same behavior in some other cases too. The CIGAR string contains an unexpected
NIND
pattern, withN
= an integer usually lower than 8.I'm using anchorwave v1.2.1 and paftools.js version 2.26-r1175.
Thank you in advance for any help you can provide!
The text was updated successfully, but these errors were encountered: