-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when running ConvertToRefFlat with GTF for rat #50
Comments
Hi @jentiger82 , Your GTF has multiple problems:
-Alec |
I ran into this issue (with the current 2.3.0 release) for different reasons. I ensured that all rows had "gene_name" or "transcript_name" attributes. If I understand correctly, "gene_biotype" and "gene_version" are also required. I added these to an Ensembl GTF file but the issue persisted. I managed to identify the lines of the GTF file that were not able to be parsed. It appears that this issue was caused by gene names containing a semicolon ";" character as AnnotationUtils.parseOptionalFields called by GTFParser.parseLine splits on this character. Is there any way to avoid or escape these characters? A sanity check for this would be helpful at least. As a workaround, I have removed these ":" characters from the gene_name attributes in my annotation.gtf file and it seem to work. |
Hi @TomKellyGenetics , Not that your description isn't clear, but could you supply a minimal example GTF and sequence dictionary to reproduce the problem? Regards, Alec |
Hi @alecw, thanks for getting back to me. Mainly reporting this in case others run into the issue since the error messages are rather cryptic and it took a while to narrow it down. I’m not a Java expert but looking at the source code, I’m not sure there’s a straightforward way to address it. Unless it’s possible to keep attributes enclosed in quotes? Or is it possible to distinguish between semicolons followed by a space from with an alphanumeric? That’s how my workaround with sed works. I’ll try get an MWE file ready and update this thread. Cheers, |
@alecw here is the data for an MWE:
(base) tom@x86_64-conda_cos6-linux-gnu ➜ arabidopsis_thaliana_TAIR10_45 cat test.gtf
Code to reproduce the error:
Error message:
update: this issue can be reproduced with ";" characters in gene names but the current version gives good error messages for missing attributes such as |
Is there anything left to resolve in this issue? I'll note that ";" is a delimiter in the GTF file in column 9, so any file using it as part of a gene name is breaking the spec - we can't do much about that. |
I'm no Java expert but I've narrowed the issue down to here. Gene names are in quotes in a GTF file so it should be possible to parse them with special characters. I've traced the issue to here where quotes are ignored and ";" is used as a separator.
This causes an issue as the remaining attributes will be split incorrectly.
It shouldn't be too much trouble to fix by changing how rows are parsed, although I'm mainly reporting this issue so that users can check for gene names with ";" characters and manually edit them if they encounter this edge case. Of course, it's completely understandable that this issue wasn't identified when testing on Mouse or Human samples that don't have genes with this character. |
That's pretty helpful, thanks for the detailed feedback! I'll note that some GTF files don't use quotes in the names, in which case we can't recover genes with semicolons in the names, but it should be possible to parse out those that remain in quotes. If I have some spare time, I'll spend a bit of time improving the parsing then bump this issue. Example of a GTF we're still using for some data sets: |
Thanks for looking into it. If it were in a language I were more confident in, I’d try it out and send a PR. If it helps, I think it’s safe to assume that a semicolon followed by a quote or space is a separator. If a semicolon is in a gene name, it’s probably in the middle rather than the end: for example (“geneA;homolog1”). Unfortunately these “optional” attributes tend to vary a lot between databases. |
Hi there,
when I try to run tools that parse through my GTF file like ConvertToRefFlat. This is the error I get:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at org.broadinstitute.dropseqrna.annotation.AnnotationUtils.parseOptionalFields(AnnotationUtils.java:383)
at org.broadinstitute.dropseqrna.annotation.GTFParser.parseLine(GTFParser.java:114)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:88)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.(PeekableIterator.java:38)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.(FilteredIterator.java:37)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.(GTFReader.java:109)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.(GTFReader.java:107)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:70)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:61)
at org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat.doWork(ConvertToRefFlat.java:71)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
I think it must be a problem with the GTF file I have for Rat, but I cannot see a problem when I compare it to the example GTF files that you have provided. Can you please take a look and help out with this issue.
I attached my GTF file.
Thanks, Jenny
rn6_ensemblGenes.zip
The text was updated successfully, but these errors were encountered: