Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct GFF reading: 1001G scalability #101

Open
3 tasks
josiahseaman opened this issue Aug 10, 2020 · 5 comments
Open
3 tasks

Direct GFF reading: 1001G scalability #101

josiahseaman opened this issue Aug 10, 2020 · 5 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@josiahseaman
Copy link
Member

josiahseaman commented Aug 10, 2020

For the 1001 Genomes project we have 23 A. thaliana genomes that are 150 Mbp. Each genome will have its own annotation of >30,000 genes for a total of 750,000 paths. This requires some consideration about scalability.

Important Note: 1001G Graph (was previously) being built with RevealGraph, now using seqwish.

@ekg a key question is, will the approach of using alignment really scale to 750,000 short sequences or will this break down at some point? My main concern is that at some point it becomes a statistical certainty that we'll get off-target matches (even at 100% identity) or that some annotation will fail to meet the alignment criteria. @AndreaGuarracino and @mandosoft mentioned needing to tweak the alignment parameters with the 30 SARS-CoV-2 genes in order to get it to work. The obvious case are short UTRs where the sequences are clearly not unique but we annotate short stretches because of their context.

Changes Needed

  • Remove list of path names from each chunk file
  • Stress test Annotation from alignment
  • Translate Annotation coordinates directly into Node ids rather than using alignment from sequence
@josiahseaman josiahseaman added enhancement New feature or request question Further information is requested labels Aug 10, 2020
@josiahseaman
Copy link
Member Author

Christian has changed graph construction to use seqwish plus his own "node unrolling" solution for nodes with high traversal (repeats).

@josiahseaman
Copy link
Member Author

josiahseaman commented Sep 4, 2020

GFF => GFA Paths Conversion Spec

@mandosoft Here are some procedures to guide you.
For each genome, read the annotation GFF in that coordinate frame and follow the same coordinates in the graph.

  • Use path index server or more directly GCSE or odgi index services to get the node name that starts a nucleotide position of an annotation.
  • Mockup Version: Start the annotation at the beginning of the node. Copy the path substring from the GFA of the genome from the start to end position of the exon. (end - start) bp of nodes (inclusive).
  • Create a new path with the gene name, and node traversals concatenated from multiple exons, add it to the GFA or HandleGraph.
  • Output the new Graph and visualize it in Pantograph

Harder version: Exact Start and End positions

  • Find the Node X an annotation starts in (including second exon). If it's not the first nucleotide, create two new nodes with sequence A[0: split] and B[split: -1]. Annotation will start on B. All references Node X must now be replaced with both A and B.
    • This changes the number of nodes per path, but I don't think this is every indexed anywhere? It should be a linked list.
  • Same procedure for ending an annotation when it is not the last nucleotide in the node.
  • Node X is deleted from Graph.

@ekg
Copy link

ekg commented Sep 4, 2020 via email

@AndreaGuarracino
Copy link
Member

@josiahseaman, for the first alpha release of Pantograph, I did exactly a part of what Erik suggests. Now I have implemented the complete suggestion, and here the draft, but working, gff2paf+fasta script to start to tackle a scalable annotation.

https://gist.github.com/AndreaGuarracino/855bb5202fe7a083e43f4ccf6abed35c

@ekg
Copy link

ekg commented Sep 6, 2020

Thank you @AndreaGuarracino, your script clarifies how straightforward this is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants