Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running ConvertToRefFlat with GTF for rat #50

Open
jentiger82 opened this issue Oct 24, 2018 · 9 comments
Open

Error when running ConvertToRefFlat with GTF for rat #50

jentiger82 opened this issue Oct 24, 2018 · 9 comments

Comments

@jentiger82
Copy link

Hi there,

when I try to run tools that parse through my GTF file like ConvertToRefFlat. This is the error I get:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at org.broadinstitute.dropseqrna.annotation.AnnotationUtils.parseOptionalFields(AnnotationUtils.java:383)
at org.broadinstitute.dropseqrna.annotation.GTFParser.parseLine(GTFParser.java:114)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:88)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.(PeekableIterator.java:38)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.(FilteredIterator.java:37)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.(GTFReader.java:109)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.(GTFReader.java:107)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:70)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:61)
at org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat.doWork(ConvertToRefFlat.java:71)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

I think it must be a problem with the GTF file I have for Rat, but I cannot see a problem when I compare it to the example GTF files that you have provided. Can you please take a look and help out with this issue.
I attached my GTF file.

Thanks, Jenny
rn6_ensemblGenes.zip

@alecw
Copy link
Collaborator

alecw commented Oct 24, 2018

Hi @jentiger82 ,

Your GTF has multiple problems:

  • trailing space on each line. The program should handle this better, but currently it doesn't. This will be fixed in next release so trailing space is ignored, but for now delete the space at the end of each line.
  • each line needs a gene_name and transcript_name. Note that the error reporting for this is currently not working properly, so when you fix the first problem, you'll get a NullPointerException. The bad error reporting will be fixed in next release, but you'll need to add a gene_name and transcript_name. Presumably these could be the same as the gene_id and transcript_id, which you do have in your GTF.

-Alec

@TomKellyGenetics
Copy link

I ran into this issue (with the current 2.3.0 release) for different reasons. I ensured that all rows had "gene_name" or "transcript_name" attributes. If I understand correctly, "gene_biotype" and "gene_version" are also required. I added these to an Ensembl GTF file but the issue persisted.

I managed to identify the lines of the GTF file that were not able to be parsed. It appears that this issue was caused by gene names containing a semicolon ";" character as AnnotationUtils.parseOptionalFields called by GTFParser.parseLine splits on this character. Is there any way to avoid or escape these characters? A sanity check for this would be helpful at least.

As a workaround, I have removed these ":" characters from the gene_name attributes in my annotation.gtf file and it seem to work.

@alecw
Copy link
Collaborator

alecw commented Nov 14, 2019

Hi @TomKellyGenetics ,

Not that your description isn't clear, but could you supply a minimal example GTF and sequence dictionary to reproduce the problem?

Regards, Alec

@alecw alecw reopened this Nov 14, 2019
@TomKellyGenetics
Copy link

Hi @alecw, thanks for getting back to me. Mainly reporting this in case others run into the issue since the error messages are rather cryptic and it took a while to narrow it down.

I’m not a Java expert but looking at the source code, I’m not sure there’s a straightforward way to address it. Unless it’s possible to keep attributes enclosed in quotes? Or is it possible to distinguish between semicolons followed by a space from with an alphanumeric? That’s how my workaround with sed works.

I’ll try get an MWE file ready and update this thread.

Cheers,
Tom

@TomKellyGenetics
Copy link

TomKellyGenetics commented Nov 26, 2019

@alecw here is the data for an MWE:

>test.dict
@HD	VN:1.6
@SQ	SN:1	LN:119940	M5:ab59e5bcec68cd2a1361ea90205e2355	UR:file:/home/user/reference/DropSeqPipe/genus_species_build_release/test2.fa
@SQ	SN:2	LN:300000	M5:d9f95d83bcab2325d7800e20fc62cc8f	UR:file:/home/user/reference/DropSeqPipe/genus_species_build_release/test2.fa

(base) tom@x86_64-conda_cos6-linux-gnu ➜ arabidopsis_thaliana_TAIR10_45 cat test.gtf

>test.gtf
#!genome-build build-name
#!genome-version release-name
#!genome-date YYYY-MM
#!genome-build-accession accession-id
#!genebuild-last-updated YYYY-MM
1	database	gene	104641	106909	.	+	.	gene_id "ID1"; gene_name "GENE1"; transcript_name "ID1"; transcript_id "ID1"; gene_source "database"; gene_biotype "protein_coding"; gene_version "1";
1	database	transcript	104641	106909	.	+	.	gene_id "ID1"; transcript_id "ID1.1";  transcript_name "ID1.1"; gene_name "GENE1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; gene_version "1";
1	database	exon	104641	104923	.	+	.	gene_id "ID1"; transcript_id "ID1.1";  transcript_name "ID1.1"; exon_number "1"; gene_name "GENE1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; exon_id "ID1.1.exon1"; gene_version "1";
1	database	CDS	104770	104923	.	+	0	gene_id "ID1"; transcript_id "ID1.1";  transcript_name "ID1.1"; exon_number "1"; gene_name "GENE1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; protein_id "ID1.1"; gene_version "1";
1	database	start_codon	104770	104772	.	+	0	gene_id "ID1"; transcript_id "ID1.1";  transcript_name "ID1.1"; exon_number "1"; gene_name "GENE1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; gene_version "1";
2	database	gene	210150	214574	.	+	.	gene_id "ID2"; gene_name "GENE2”; transcript_name "ID2"; transcript_id "ID2";A1"; gene_source "database"; gene_biotype "protein_coding"; gene_version "1";
2	database	transcript	210150  214574	.	+	.	gene_id "ID2"; transcript_id "ID2.1"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; gene_version "1";
2	database	exon	210150	210959	.	+	.	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "1"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; exon_id "ID2.1.exon1"; gene_version "1";
2	database	CDS	210338	210959	.	+	0	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "1"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; protein_id "ID2.1"; gene_version "1";
2	database	start_codon	210338	210340	.	+	0	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "1"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; gene_version "1";
2	database	exon	211695	21264	.	+	.	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "2"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; exon_id "ID2.1.exon2"; gene_version "1";
2	database	CDS	211695	21264	.	+	2	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "2"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; protein_id "ID2.1"; gene_version "1";
2	database	exon	212150	212198	.	+	.	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "3"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; exon_id "ID2.1.exon3"; gene_version "1";
2	database	CDS	212150	212198	.	+	1	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "3"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; protein_id "ID2.1"; gene_version "1";
2	database	exon	212283	212354	.	+	.	gene_id "ID2"; transcript_id "ID2.1";  transcript_name "ID2.1"; exon_number "4"; gene_name "GENE2;A1"; gene_source "database"; gene_biotype "protein_coding"; transcript_source "database"; transcript_biotype "protein_coding"; exon_id "ID2.1.exon4"; gene_version "1";

Code to reproduce the error:

java -jar dropseq.jar  ReduceGtf GTF=test.gtf SD=test.dict O=out.gtf

Error message:

INFO	2019-11-26 11:55:18	ReduceGtf	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-    Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    ReduceGtf -GTF test.gtf -SD test.dict -O out.gtf
**********


11:55:18.627 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/tom/local/bin/Drop-seq_tools-2.3.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Nov 26 11:55:18 JST 2019] ReduceGtf SEQUENCE_DICTIONARY=test.dict GTF=test.gtf OUTPUT=out.gtf    FEATURE_TYPE=[gene, transcript, exon] IGNORE_FUNC_TYPE=[pseudogene, polymorphic_pseudogene, TR_J_pseudogene, TR_V_pseudogene, IG_C_pseudogene, IG_J_pseudogene, IG_V_pseudogene] ENHANCE_GTF=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Nov 26 11:55:18 JST 2019] Executing as [email protected] on Linux 4.4.0-75-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-1~deb9u1-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.3.0(34e6572_1555443285)
[Tue Nov 26 11:55:18 JST 2019] org.broadinstitute.dropseqrna.annotation.ReduceGtf done.     Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
	at org.broadinstitute.dropseqrna.annotation.AnnotationUtils.parseOptionalFields(AnnotationUtils.java:386)
	at org.broadinstitute.dropseqrna.annotation.GTFParser.parseLine(GTFParser.java:114)

update: this issue can be reproduced with ";" characters in gene names but the current version gives good error messages for missing attributes such as transcript_name

@jamesnemesh
Copy link
Collaborator

Is there anything left to resolve in this issue? I'll note that ";" is a delimiter in the GTF file in column 9, so any file using it as part of a gene name is breaking the spec - we can't do much about that.

@TomKellyGenetics
Copy link

TomKellyGenetics commented Feb 19, 2020

I'm no Java expert but I've narrowed the issue down to here. Gene names are in quotes in a GTF file so it should be possible to parse them with special characters.

I've traced the issue to here where quotes are ignored and ";" is used as a separator.

This causes an issue as the remaining attributes will be split incorrectly.

final String geneName=attributesMap.get("gene_name");

It shouldn't be too much trouble to fix by changing how rows are parsed, although I'm mainly reporting this issue so that users can check for gene names with ";" characters and manually edit them if they encounter this edge case. Of course, it's completely understandable that this issue wasn't identified when testing on Mouse or Human samples that don't have genes with this character.

@jamesnemesh
Copy link
Collaborator

That's pretty helpful, thanks for the detailed feedback! I'll note that some GTF files don't use quotes in the names, in which case we can't recover genes with semicolons in the names, but it should be possible to parse out those that remain in quotes. If I have some spare time, I'll spend a bit of time improving the parsing then bump this issue.

Example of a GTF we're still using for some data sets:
#!genome-date 2013-12
#!genome-version GRCh38
chr1 mirbase exon 17369 17436 . - . gene_id ENSG00000278267; gene_version 1; transcript_id ENST00000619216; transcript_version 1; exon_number 1; gene_name MIR6859-1; gene_so
urce mirbase; gene_biotype miRNA; transcript_name MIR6859-1-201; transcript_source mirbase; transcript_biotype miRNA; exon_id ENSE00003746039; exon_version 1; tag basic; transcript_support_level NA;

@TomKellyGenetics
Copy link

Thanks for looking into it. If it were in a language I were more confident in, I’d try it out and send a PR.

If it helps, I think it’s safe to assume that a semicolon followed by a quote or space is a separator. If a semicolon is in a gene name, it’s probably in the middle rather than the end: for example (“geneA;homolog1”). Unfortunately these “optional” attributes tend to vary a lot between databases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants