GFA file of only gap records segfaults #30

sjackman · 2018-05-04T23:24:27Z

A GFA file ought to include both segments and gap records. It'd be preferable if gfakluge didn't segfault when encountering such a file.

H	VN:Z:2.0
G	*	6+	50+	121	58	FC:i:1
G	*	6+	225+	-57	58	FC:i:1
G	*	6+	298-	-83	8	FC:i:55
G	*	6-	62-	-80	9	FC:i:47
G	*	6-	171-	-67	41	FC:i:2

❯❯❯ gfak stats -A gaps.gfa
[1]    98433 segmentation fault

The text was updated successfully, but these errors were encountered:

edawson · 2018-05-08T15:53:46Z

Yikes. I assume we'd prefer an error (e.g. "Segment not found for gap <gap_id>")?

Just to verify I understand correctly: this is not valid GFA, and we should never get GFA that has the records spread across multiple files like this, right?

sjackman · 2018-05-14T19:31:41Z

Short answer, yes. It's not valid GFA.

Long answer. ABySS produces a GFA file of the segment records and edge records. For large genomes this file can be quite large. In a second step, ABySS then uses the paired-end and mate-pair reads to estimate the distances between segments and outputs the gap records. Rather than make a copy of the potentially large S+E records, it outputs only the gap records. ABySS can handle reading a GFA file spread across multiple files for this reason. It'd be useful to me if Gfakluge could also read these split files. Your call of course whether you want to support that or not. It's easy enough to use either awk or abyss-todot (a misnomer now since it handles more than just GraphViz files) to combine these two GFA files into a single file for Gfakluge.

edawson · 2018-05-17T10:29:58Z

Interesting. How big are these two files?

I have been thinking about restructuring the command line tools to not build the GFAKluge object when the graph isn't being modified. When I get around to this I'll add support for breaking the graph into multiple files (with a stern warning, of course).

cating the gaps file to the seqs/edges file sounds like it might work as-is, unless I missed something.

edawson · 2018-05-17T14:20:30Z

I guess I should mention: tools that don't modify the graph are:

stats
extract
diff

These tools would support abyss' split file format, with a warning. The rest of the tools should support the complete ( (S + E) + (G) ) file, even if it is very large, and should be able to handle it regardless of order. I didn't intend to enforce an order to GFA files in GFAkluge but it seems I've done it by accident for gap records (and probably edges as well).

sjackman · 2018-05-18T23:29:05Z

Interesting. How big are these two files?

For a human genome:
FASTA: 2.9 GB
S+E with * for sequences: 137 MB
G: 10 MB

Thanks again, Eric!

edawson added this to the v0.2.4 milestone May 12, 2018

edawson modified the milestones: v0.2.4, v0.3 Sep 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GFA file of only gap records segfaults #30

GFA file of only gap records segfaults #30

sjackman commented May 4, 2018

edawson commented May 8, 2018

sjackman commented May 14, 2018

edawson commented May 17, 2018

edawson commented May 17, 2018 •

edited

Loading

sjackman commented May 18, 2018 •

edited

Loading

GFA file of only gap records segfaults #30

GFA file of only gap records segfaults #30

Comments

sjackman commented May 4, 2018

edawson commented May 8, 2018

sjackman commented May 14, 2018

edawson commented May 17, 2018

edawson commented May 17, 2018 • edited Loading

sjackman commented May 18, 2018 • edited Loading

edawson commented May 17, 2018 •

edited

Loading

sjackman commented May 18, 2018 •

edited

Loading