Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: annotation of data in query coordinates #1578

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented Mar 9, 2025

Resolves #980
Resolves #1571

Supersedes #982 (closes #982)

The idea is to output genome annotation which is similar to the input reference annotation, but for query sequences.

This is done by cloning the reference annotation and adjustung coordinates and some other fields to match a query sequence. A stream of these annotations is being written to the newly introduced output files GFF3 and TBL, as well as to the existing JSON/NDJSON files.

Work items:

  • Add --output-annotation-tbl to output GenBank .tbl file with query annotations
  • Add --output-annotation-gff to output .gff file with query annotations
  • Add annotation field containing JSON dump of the internal GeneMap struct into each entry in the results array in JSON and NDJSON outputs (the schemas of these formats are unstable, as before)
  • In GFF file, features for each sequence is being written with seqid column (column 1) set to sequence name in the input fasta file
  • In Genbank TBL file, each >Feature block header contains sequence name in the input fasta file
  • The annotations are being streamed to GFF, TBL and NDJSON files as sequences are processed. The JSON output is written in one go, as before.
  • Add web export

Deviations/particularities

  • "source" column (column 2) in GFF is set to "nextclade"

  • adds seq_index attribute, to allow to identify query sequences by index in fasta file, as opposed to by name only (unreliable in presence of duplicate names)

  • adds is_reverse_complement attribute when --retry-reverse-complement results in reverse-complemented sequence output. (Note that sequence name also contains a suffix in this case).

  • virtual genes and CDSes which were not present in the input reference annotation are being written. Nextclade creates them when input annotation contains only CDSes or only genes.

  • Name attribute is always added if not present. There are complex rules to deduce the optimal name when Nextclade parses input gff annotation. The result of it is added to output attributes if there's no Name attribute yet.

  • comments and pragmas from reference annotation file are not being output. Comments are not being parsed at all, and are lost in the output. The sequence-region pragma is parsed, but will require some more work to output it. The rest of pragmas are lost.

  • gene boundaries (not physically existing in CDS-based world, but somehow required in both, the GFF and TBL files) are calculated as a min/max of the boundaries of child CDSes

  • I think that codon_start in TBL is the same as phase in GFF, but not sure

  • strand orientation:

    • GFF: CDS strand is written as is. Genes, to my understanding, don't have strand orientation, because they can consist of multiple CDS segments with different orientation. So genes I left genes with . instead of a particular strand value (even if ).

    • TBL: for CDS the range boundaries are correctly swapped if strand is nagative. For genes, similarly to GFF, the strandedness is undefined, so I leave genes with forward orientation always.

    • does any of it need to be adjusted if we reverse-complement the query when --retry-reverse-complement is passed?

  • "translation" attribute is removed from output, to avoid emitting reference translations into query annotation

Defects

  • currently the seqid column in gff and feature name in tbl contain full sequence name line (id+ desc). Nextclade traditionally passes this value though without parsing, and up until now we haven't dealt with the restricted file firmats which don't allow spaces. In order to fix this, we need to split the seq_name on first space, either in fasta parser or when writing the gff and tbl files.

  • empty or semi-empty files are written when there's no input annotation

GFF: CDS strand is fixed in this commit. But  genes, to my understanding, don't have strand orientation, because they can consist of multiple CDS segments with different orientation. So genes I left genes with `.` instead of a particular strand value.

TBL: for CDS the range boundaries are correctly swapped, so no changes are needed. For genes, similarly to GFF, the strandedness is undefined, so I leave genes with forward orientation always.
@ivan-aksamentov ivan-aksamentov marked this pull request as ready for review March 10, 2025 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Annotation of data in query coordinates ENH: Output feature table for use during Genbank submission
2 participants