-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct GFF reading: 1001G scalability #101
Comments
Christian has changed graph construction to use seqwish plus his own "node unrolling" solution for nodes with high traversal (repeats). |
GFF => GFA Paths Conversion Spec@mandosoft Here are some procedures to guide you.
Harder version: Exact Start and End positions
|
My suggestion would be to make a script that produces the FASTA file for a
given GFF based on a given reference while simultaneously producing a PAF
that describes the mapping of each of the sequences in the GFF FASTA into
the chosen reference sequence.
Then, by adding this PAF and FASTA into the alignment, we can directly
embed the annotations in the pangenome. The only code that has to be
written is the conversion of the GFF ranges into PAF. BedTools already has
a way to make the FASTA.
…On Fri, Sep 4, 2020 at 7:08 PM Josiah Seaman ***@***.***> wrote:
GFF => GFA Paths Conversion Spec
For each genome, read the annotation GFF in that coordinate frame and
follow the same coordinates in the graph.
- Use path index server or more directly GCSE or odgi index services
to get the node name that starts a nucleotide position of an annotation.
- Mockup Version: Start the annotation at the beginning of the node.
Copy the path substring from the GFA of the genome from the start to end
position of the exon. (end - start) bp of nodes (inclusive).
- Create a new path with the gene name, and node traversals
concatenated from multiple exons, add it to the GFA or HandleGraph.
- Output the new Graph and visualize it in Pantograph
Harder version: Exact Start and End positions
- Find the Node X an annotation starts in (including second exon). If
it's not the first nucleotide, create two new nodes with sequence A[0:
split] and B[split: -1]. Annotation will start on B. All references Node X
must now be replaced with both A and B.
- This changes the number of nodes per path, but I don't think this
is every indexed anywhere? It should be a linked list.
- Same procedure for ending an annotation when it is not the last
nucleotide in the node.
- Node X is deleted from Graph.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEIGCUKSVKJPARUUCGLSEENKRANCNFSM4P2DNOQA>
.
|
@josiahseaman, for the first alpha release of Pantograph, I did exactly a part of what Erik suggests. Now I have implemented the complete suggestion, and here the draft, but working, https://gist.github.com/AndreaGuarracino/855bb5202fe7a083e43f4ccf6abed35c |
Thank you @AndreaGuarracino, your script clarifies how straightforward this is. |
For the 1001 Genomes project we have 23 A. thaliana genomes that are 150 Mbp. Each genome will have its own annotation of >30,000 genes for a total of 750,000 paths. This requires some consideration about scalability.
Important Note: 1001G Graph (was previously) being built with RevealGraph, now using seqwish.
@ekg a key question is, will the approach of using alignment really scale to 750,000 short sequences or will this break down at some point? My main concern is that at some point it becomes a statistical certainty that we'll get off-target matches (even at 100% identity) or that some annotation will fail to meet the alignment criteria. @AndreaGuarracino and @mandosoft mentioned needing to tweak the alignment parameters with the 30 SARS-CoV-2 genes in order to get it to work. The obvious case are short UTRs where the sequences are clearly not unique but we annotate short stretches because of their context.
Changes Needed
The text was updated successfully, but these errors were encountered: