feat: annotation of data in query coordinates #1578

ivan-aksamentov · 2025-03-09T17:36:18Z

Resolves #980
Resolves #1571

Supersedes #982 (closes #982)

The idea is to output genome annotation which is similar to the input reference annotation, but for query sequences.

This is done by cloning the reference annotation and adjustung coordinates and some other fields to match a query sequence. A stream of these annotations is being written to the newly introduced output files GFF3 and TBL, as well as to the existing JSON/NDJSON files.

Work items:

Add --output-annotation-tbl to output GenBank .tbl file with query annotations
Add --output-annotation-gff to output .gff file with query annotations
Add annotation field containing JSON dump of the internal GeneMap struct into each entry in the results array in JSON and NDJSON outputs (the schemas of these formats are unstable, as before)
In GFF file, features for each sequence is being written with seqid column (column 1) set to sequence name in the input fasta file
In Genbank TBL file, each >Feature block header contains sequence name in the input fasta file
The annotations are being streamed to GFF, TBL and NDJSON files as sequences are processed. The JSON output is written in one go, as before.
Add web export

Deviations/particularities

"source" column (column 2) in GFF is set to "nextclade"
adds seq_index attribute, to allow to identify query sequences by index in fasta file, as opposed to by name only (unreliable in presence of duplicate names)
adds is_reverse_complement attribute when --retry-reverse-complement results in reverse-complemented sequence output. (Note that sequence name also contains a suffix in this case).
virtual genes and CDSes which were not present in the input reference annotation are being written. Nextclade creates them when input annotation contains only CDSes or only genes.
Name attribute is always added if not present. There are complex rules to deduce the optimal name when Nextclade parses input gff annotation. The result of it is added to output attributes if there's no Name attribute yet.
comments and pragmas from reference annotation file are not being output. Comments are not being parsed at all, and are lost in the output. The sequence-region pragma is parsed, but will require some more work to output it. The rest of pragmas are lost.
gene boundaries (not physically existing in CDS-based world, but somehow required in both, the GFF and TBL files) are calculated as a min/max of the boundaries of child CDSes
I think that codon_start in TBL is the same as phase in GFF, but not sure
strand orientation:
- GFF: CDS strand is written as is. Genes, to my understanding, don't have strand orientation, because they can consist of multiple CDS segments with different orientation. So genes I left genes with . instead of a particular strand value (even if ).
- TBL: for CDS the range boundaries are correctly swapped if strand is nagative. For genes, similarly to GFF, the strandedness is undefined, so I leave genes with forward orientation always.
- does any of it need to be adjusted if we reverse-complement the query when --retry-reverse-complement is passed?
"translation" attribute is removed from output, to avoid emitting reference translations into query annotation

Defects

currently the seqid column in gff and feature name in tbl contain full sequence name line (id+ desc). Nextclade traditionally passes this value though without parsing, and up until now we haven't dealt with the restricted file firmats which don't allow spaces. In order to fix this, we need to split the seq_name on first space, either in fasta parser or when writing the gff and tbl files.
empty or semi-empty files are written when there's no input annotation

By using `IndexMap` (order of iteration is the same as order of insertion) instead of `HashMap` (order of iteration is non-deterministic) we ensure consistent order of features and attributes in annotation structs on all levels of hierarchy. This will be useful when we start output these features.

From #982

It will show up in output json, so let's make it a bit clearer what this is

packages/nextclade/src/run/nextclade_run_one.rs

packages/nextclade/src/io/gff3_writer.rs

packages/nextclade/src/io/genbank_tbl.rs

packages/nextclade-cli/src/cli/nextclade_cli.rs

GFF: CDS strand is fixed in this commit. But genes, to my understanding, don't have strand orientation, because they can consist of multiple CDS segments with different orientation. So genes I left genes with `.` instead of a particular strand value. TBL: for CDS the range boundaries are correctly swapped, so no changes are needed. For genes, similarly to GFF, the strandedness is undefined, so I leave genes with forward orientation always.

… from the queries

Feat/use qry coords

ivan-aksamentov added 11 commits March 6, 2025 13:42

refactor: split out cds segment range extraction logic

f22c14b

feat: port tbl writer

cb1faf5

From #982

feat: add rough version of tbl and gff output

5f2128c

refactor: rename genbank tbl writer class for consistency with others

9e50d88

refactor: move gff conversion to gff module

f386d13

fix: ensure 1-based indices in gff and tbl

3839ba3

refactor: rename field gene_map_qry to annotation

b1451fa

It will show up in output json, so let's make it a bit clearer what this is

fix: ensure parent-child relations in gff output add Name attribute

bb0f2c5

feat: don't write annotation json if empty

3f653a8

feat: write gff header

0265cc9

ivan-aksamentov commented Mar 10, 2025

View reviewed changes

packages/nextclade/src/run/nextclade_run_one.rs Outdated Show resolved Hide resolved

ivan-aksamentov commented Mar 10, 2025

View reviewed changes

packages/nextclade/src/io/gff3_writer.rs Outdated Show resolved Hide resolved

ivan-aksamentov commented Mar 10, 2025

View reviewed changes

packages/nextclade/src/io/genbank_tbl.rs Outdated Show resolved Hide resolved

ivan-aksamentov commented Mar 10, 2025

View reviewed changes

packages/nextclade-cli/src/cli/nextclade_cli.rs Show resolved Hide resolved

ivan-aksamentov marked this pull request as ready for review March 10, 2025 02:13

rneher and others added 13 commits March 10, 2025 10:31

feat: add map from aln to qry and utility functions ref->aln->qry

c3d198f

test: add simple script to read nextclade.gff and translate sequences…

089b3d2

… from the queries

Merge pull request #1579 from nextstrain/feat/use-qry-coords

2031f47

Feat/use qry coords

feat: filter out "translation" attribute from gff and tbl outputs

4b51ff8

test: extend maps_example test for aln_qry_map

cda82b2

test: add test for coordinate mapping of cds ranges from ref to qry

f5f2897

lint: remove unused import

29f8183

fix: only output seq id rather than id+desc to gff and tbl

39563ba

refactor: deduplicate gff and tbl attribute calculation

19cd982

fix: clear codon_start attribute to avoid duplication

7e19290

refactor: fix comment

1097e36

feat: remove seq index padding in gene id

ca4cf82

feat: avoid panic on empty sequence name

32ecb7e

docs: update auto-generated cli reference docs

e9f85b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: annotation of data in query coordinates #1578

feat: annotation of data in query coordinates #1578

ivan-aksamentov commented Mar 9, 2025 •

edited

Loading

feat: annotation of data in query coordinates #1578

Are you sure you want to change the base?

feat: annotation of data in query coordinates #1578

Conversation

ivan-aksamentov commented Mar 9, 2025 • edited Loading

Deviations/particularities

Defects

ivan-aksamentov commented Mar 9, 2025 •

edited

Loading