Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to treat --- in functional.transcripts.fasta files for BUSCO analyses #1

Open
meixilin opened this issue Feb 8, 2022 · 0 comments

Comments

@meixilin
Copy link

meixilin commented Feb 8, 2022

Dear DNAzoo annotation team,

Thank you for making such an amazing consortium with chromosome level assembly data. I am currently trying to compare some DNAzoo assembly with assemblies on NCBI for BUSCO analyses and have a very small question. However, I noticed that for some <genomename>functional.transcripts.fasta files, the transcripts includes -- sites which are not accepted for BUSCO.

For example, in the brydes whale annotations:

Balaenoptera_edeni_HiC.fasta_v2.functional.transcripts.fasta.gz there are lines that look like these:

>Balaenoptera_027329-RA transcript Name:"Similar to Bzw1 Basic leucine zipper and W2 domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:2 AED:0.00 eAED:0.00 QI:0|-1|1|1|-1|0|1|219|268
CTATGTTGACTGGTGTTCTTCTGGCTAATGGAACACTTAATGCATCCATTCTTAATAGCC
TTTATAATGAGAATTTGGTTAAAGAAGGGGTTTCAGCAGCTTTTGCTGTAAAGCTCTTTA
AATCATGGATAAATGAAAAAGATATCAATGCAGTAGCTGCAAGTCTTCGGAAAGTCAGCA
TGGATAACAGACTGATGGAACTTTTTCCTGCCAATAAACAAAGCGTTGAACACTTCACTA
AGTATTTTACTGAGGCAGGCTTGAAAGAACTCTCAGAGTATGTTCGAAATCAGCAAACCA
TAGGAGCTCGAAAGGAACTCCAGAAAGAACTTCAAGAACAGATGTCCCGTGGTGATCCAT
TTAAGGATATAATTTTGTATGTCAAGGAGGAGATGAAAAAAAACAACATCCCAGAACCCG
TTGTCATTGGGATAGTCTGGTCCAGCGTAATGAGCACCGTGGAATGGAACAAAAAGGAAG
AGCTTGTAGCAGAGCAGGCCATCAAGCACTTGAAGCAATACAGCCCTCTACTTGCTGCCT
TTACTACTCAAGGTCAGTCTGAGCTGACTCTGTTACTGAAGATTCAGGAGTATTGCTATG
ACAACATTCATTTCATGAAAGCCTTCCAGAAAATCGTGGTGCTTTTTTATAAAGCTGAAG
TCCTGAGTGAAGAGCCCATTTTGAAGTGGTATAAAGATGCACATGTTGCAAAGGGAAAAA
GTGTCTTCCTTGAGCAAATGAAAAAGTTTGTAGAGTGGCTCAAAAATGCTGAAGAAGAAT
CTGAGTCTGAAGCTGAAGAAGTTAGGAGTAATGGA--------CCCCGGCATGGCAAACA
GTTGAAGAACGGAGAAAACTGGATAGCTGACCT-TCCAGATAGTTGTTGGCACTCAGAAC
CACC-----TCAAG-----TACA--GCCATCCAAACCAGTAATTACATTGCTGCATTATT
TCTGTGTTAACTGTGAAAT-CTG--CTGCTTGTCTGTACCCTTGAAATGGAA-TAAAATT
TC-ATG

However the - is not a valid base in the BUSCO software:

Bio.Data.CodonTable.TranslationError: Codon '-TC' is invalid

I was wondering should I remove the - (GA--------CC to GACC ) or change them to N (GA--------CC to GANNNNNNNNCC) for BUSCO compatibility?

Thank you very much in advance! And thank you for making this amazing pipeline available!

Best,
Meixi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant