Direct GFF reading: 1001G scalability #101

josiahseaman · 2020-08-10T16:04:07Z

For the 1001 Genomes project we have 23 A. thaliana genomes that are 150 Mbp. Each genome will have its own annotation of >30,000 genes for a total of 750,000 paths. This requires some consideration about scalability.

Important Note: 1001G Graph (was previously) being built with RevealGraph, now using seqwish.

@ekg a key question is, will the approach of using alignment really scale to 750,000 short sequences or will this break down at some point? My main concern is that at some point it becomes a statistical certainty that we'll get off-target matches (even at 100% identity) or that some annotation will fail to meet the alignment criteria. @AndreaGuarracino and @mandosoft mentioned needing to tweak the alignment parameters with the 30 SARS-CoV-2 genes in order to get it to work. The obvious case are short UTRs where the sequences are clearly not unique but we annotate short stretches because of their context.

Changes Needed

Remove list of path names from each chunk file
Stress test Annotation from alignment
Translate Annotation coordinates directly into Node ids rather than using alignment from sequence

josiahseaman · 2020-09-04T13:58:33Z

Christian has changed graph construction to use seqwish plus his own "node unrolling" solution for nodes with high traversal (repeats).

josiahseaman · 2020-09-04T17:08:42Z

GFF => GFA Paths Conversion Spec

@mandosoft Here are some procedures to guide you.
For each genome, read the annotation GFF in that coordinate frame and follow the same coordinates in the graph.

Use path index server or more directly GCSE or odgi index services to get the node name that starts a nucleotide position of an annotation.
Mockup Version: Start the annotation at the beginning of the node. Copy the path substring from the GFA of the genome from the start to end position of the exon. (end - start) bp of nodes (inclusive).
Create a new path with the gene name, and node traversals concatenated from multiple exons, add it to the GFA or HandleGraph.
Output the new Graph and visualize it in Pantograph

Harder version: Exact Start and End positions

Find the Node X an annotation starts in (including second exon). If it's not the first nucleotide, create two new nodes with sequence A[0: split] and B[split: -1]. Annotation will start on B. All references Node X must now be replaced with both A and B.
- This changes the number of nodes per path, but I don't think this is every indexed anywhere? It should be a linked list.
Same procedure for ending an annotation when it is not the last nucleotide in the node.
Node X is deleted from Graph.

ekg · 2020-09-04T17:57:27Z

My suggestion would be to make a script that produces the FASTA file for a given GFF based on a given reference while simultaneously producing a PAF that describes the mapping of each of the sequences in the GFF FASTA into the chosen reference sequence. Then, by adding this PAF and FASTA into the alignment, we can directly embed the annotations in the pangenome. The only code that has to be written is the conversion of the GFF ranges into PAF. BedTools already has a way to make the FASTA.

…

On Fri, Sep 4, 2020 at 7:08 PM Josiah Seaman ***@***.***> wrote: GFF => GFA Paths Conversion Spec For each genome, read the annotation GFF in that coordinate frame and follow the same coordinates in the graph. - Use path index server or more directly GCSE or odgi index services to get the node name that starts a nucleotide position of an annotation. - Mockup Version: Start the annotation at the beginning of the node. Copy the path substring from the GFA of the genome from the start to end position of the exon. (end - start) bp of nodes (inclusive). - Create a new path with the gene name, and node traversals concatenated from multiple exons, add it to the GFA or HandleGraph. - Output the new Graph and visualize it in Pantograph Harder version: Exact Start and End positions - Find the Node X an annotation starts in (including second exon). If it's not the first nucleotide, create two new nodes with sequence A[0: split] and B[split: -1]. Annotation will start on B. All references Node X must now be replaced with both A and B. - This changes the number of nodes per path, but I don't think this is every indexed anywhere? It should be a linked list. - Same procedure for ending an annotation when it is not the last nucleotide in the node. - Node X is deleted from Graph. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#101 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEIGCUKSVKJPARUUCGLSEENKRANCNFSM4P2DNOQA> .

AndreaGuarracino · 2020-09-06T09:43:48Z

@josiahseaman, for the first alpha release of Pantograph, I did exactly a part of what Erik suggests. Now I have implemented the complete suggestion, and here the draft, but working, gff2paf+fasta script to start to tackle a scalable annotation.

https://gist.github.com/AndreaGuarracino/855bb5202fe7a083e43f4ccf6abed35c

ekg · 2020-09-06T10:00:29Z

Thank you @AndreaGuarracino, your script clarifies how straightforward this is.

josiahseaman added enhancement New feature or request question Further information is requested labels Aug 10, 2020

josiahseaman assigned ekg Aug 10, 2020

josiahseaman assigned mandosoft Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct GFF reading: 1001G scalability #101

Direct GFF reading: 1001G scalability #101

josiahseaman commented Aug 10, 2020 •

edited

Loading

josiahseaman commented Sep 4, 2020

josiahseaman commented Sep 4, 2020 •

edited

Loading

ekg commented Sep 4, 2020 via email

AndreaGuarracino commented Sep 6, 2020

ekg commented Sep 6, 2020

Direct GFF reading: 1001G scalability #101

Direct GFF reading: 1001G scalability #101

Comments

josiahseaman commented Aug 10, 2020 • edited Loading

Changes Needed

josiahseaman commented Sep 4, 2020

josiahseaman commented Sep 4, 2020 • edited Loading

GFF => GFA Paths Conversion Spec

Harder version: Exact Start and End positions

ekg commented Sep 4, 2020 via email

AndreaGuarracino commented Sep 6, 2020

ekg commented Sep 6, 2020

josiahseaman commented Aug 10, 2020 •

edited

Loading

josiahseaman commented Sep 4, 2020 •

edited

Loading