Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvertToRefFlat fails while running Drop-seq_tools-2.4.0 #113

Open
geshtin opened this issue Oct 12, 2020 · 5 comments
Open

ConvertToRefFlat fails while running Drop-seq_tools-2.4.0 #113

geshtin opened this issue Oct 12, 2020 · 5 comments

Comments

@geshtin
Copy link

geshtin commented Oct 12, 2020

I have problems creating the refflat from the gtf file. It has been reported more often, I followed all suggestion, and as far as I can see the gtfs's look OK. I have one where it works:

1	gramene	gene	44289	49837	.	+	.	gene_id "Zm00001d027230"; gene_source "gramene"; gene_biotype "protein_coding"; gene_name "Zm00001d027230";
1	gramene	transcript	44289	49837	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	44289	44947	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon1"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	44351	44947	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	start_codon	44351	44353	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	45666	45803	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon2"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	45666	45803	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	45888	46133	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon3"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	45888	46133	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	46229	46342	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon4"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	46229	46342	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	46451	46633	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon5"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	46451	46633	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	47045	47262	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon6"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	47045	47262	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	47650	48111	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon7"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	CDS	47650	47992	.	+	1	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	stop_codon	47993	47995	.	+	0	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	48200	49247	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "8"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon8"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	exon	49330	49837	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "9"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon9"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	five_prime_utr	44289	44350	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	three_prime_utr	47996	48111	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	three_prime_utr	48200	49247	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1	gramene	three_prime_utr	49330	49837	.	+	.	gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";

and one, where it fails:
1	araport11	gene	3631	5899	.	+	.	gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1	araport11	transcript	3631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	exon	3631	3913	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1"; transcript_name "AT1G01010.1";
1	araport11	CDS	3760	3913	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	start_codon	3760	3762	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	exon	3996	4276	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2"; transcript_name "AT1G01010.1";
1	araport11	CDS	3996	4276	.	+	2	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	4486	4605	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3"; transcript_name "AT1G01010.1";
1	araport11	CDS	4486	4605	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	4706	5095	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4"; transcript_name "AT1G01010.1";
1	araport11	CDS	4706	5095	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	5174	5326	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5"; transcript_name "AT1G01010.1";
1	araport11	CDS	5174	5326	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	exon	5439	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6"; transcript_name "AT1G01010.1";
1	araport11	CDS	5439	5627	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1	araport11	stop_codon	5628	5630	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	five_prime_utr	3631	3759	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1	araport11	three_prime_utr	5631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";

with the following error:

INFO    2020-10-12 16:13:41     ConvertToRefFlat

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    ConvertToRefFlat -ANNOTATIONS_FILE /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf -SEQUENCE_DICTIONARY /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict -OUTPUT /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat
**********


16:13:42.607 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.618 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
16:13:42.628 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.628 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Mon Oct 12 16:13:42 CEST 2020] ConvertToRefFlat ANNOTATIONS_FILE=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf SEQUENCE_DICTIONARY=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict OUTPUT=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Oct 12 16:13:42 CEST 2020] Executing as [email protected] on Linux 4.15.0-23-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.4.0(3d2b3d8_1600201514)
INFO    2020-10-12 16:13:43     GTFParser       Seen many non-increasing record positions. Printing Read-names as well.
[Mon Oct 12 16:13:45 CEST 2020] org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2194669568
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
        at org.broadinstitute.dropseqrna.annotation.AnnotationUtils.parseOptionalFields(AnnotationUtils.java:386)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.parseLine(GTFParser.java:114)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:88)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
        at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
        at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
        at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:92)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.gatherByGeneName(GeneFromGTFBuilder.java:205)
        at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.<init>(GeneFromGTFBuilder.java:48)
        at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:79)
        at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:61)
        at org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat.doWork(ConvertToRefFlat.java:71)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

Both gf's have the same fields in column 9.
I see: "GTFParser Seen many non-increasing record positions"
what does this exactly mean?
So, what is wrong here? Is the order of the gene_id, transcript_id, gene_name, transcript_name important.
I assume the gene lines do not need a transcript id (btw adding this information to the gene lines did not solve my problems).
For transcript_name I simply used transcript_id
Also the one entry where there is a ";" in the gene name was removed.
And finally, there was a mismatch in the fasta and the gtf wrt the scaffold names (for Mt and Pt) this was also corrected.

Any suggestion is greatly appreciated,
Raymond

@geshtin
Copy link
Author

geshtin commented Oct 13, 2020

ls,
fyi.
I have been digging into it a bit more and came to some interesting observations.
I split up my file into smaller parts to see if somehwere there are some "strange entries"
Sometimes I came accross non-unique gene names (that is: two different gene_id entries, that have the same gene_name), But this was not a problem with just a few instances (when there were only, let's say, about 50 instance with duplicate names the refflat was created). But the entire file (more than 250 duplicate names) it crashes. Removing all these duplicate names (replacing all the gene_names by the gene_id entry): running fine.
I also noticed that "half entries" so starting at exon as first line (and no corresponding initial gene line): no problem.
So setting the VALIDATION_STRINGENCY=STRICT to some other value might also work (I did not test this), What is the limit of these type or errors??
Raymond

@seb-mueller
Copy link
Collaborator

seb-mueller commented Oct 13, 2020

Thanks for reporting, indeed, dropseq-tools was apparently developed with human/mouse in mind, which seems to be not what you use (Zea maize?). In particular, some of the annotations out there seems to have missing gene_name entries etc which still seems to be unresolved:
broadinstitute/Drop-seq#50

I'll have to go through your error log in detail, but in the meantime, maybe this is of help:
#109

For Arabidopsis, some names have a ; (so annoying, who puts that in a gene name?!), the issue above addresses this as well.
Ultimately, a new version is coming out soon, hopefully solving this issue in one way or another!

@geshtin
Copy link
Author

geshtin commented Oct 13, 2020

thanks for the response,
I saw those issue 50 and 109. so I tried various versions of adding gene_name and transcript_name.
In the end adding the gene_name to all lines (if not already there) and transcript_name (to all non gene lines) worked out fine.
The underlying problem for me was that the gene_name was not unique! A small number of non-unique entries was fine, but a large number (250) not. So i created unique gene_names and then creating the refflat file worked fine.
Indeed a problem with the gtf (or gff's) in general is that many people do strange things with them (like the ; in the gene_name) and often not follow the gtf specifications.
Raymond
btw, the reformatted entry above looks much better now ;)

@seb-mueller
Copy link
Collaborator

Indeed, non unique gene names are a big problem and probably to be brought to attention to the annotation databases/consortia! So good spot. Did you write code to make those gene names unique, if yes, you could add it here or in issue #109 or so? Maybe this can be integrated in the pipeline as an option.

@geshtin
Copy link
Author

geshtin commented Oct 13, 2020

i wrote some python code (so not Java), and it is not that nice code (true python programmers will have a lot of comments!). Not really suitable for a production environment. But it works for me, and only when the first attribute in the 9th column is the gene_id (often the case fortunately). Now I just overwrite ALL gene_names, preferably you want to keep the "nice" gene names, many people use those to get biological info. So this should be, I think handled.
But I don't mind adding the roughly 20 lines of code here if you are still interested!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants