-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: annotation of data in query coordinates #1578
Open
ivan-aksamentov
wants to merge
26
commits into
master
Choose a base branch
from
feat/qry-annotation
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
By using `IndexMap` (order of iteration is the same as order of insertion) instead of `HashMap` (order of iteration is non-deterministic) we ensure consistent order of features and attributes in annotation structs on all levels of hierarchy. This will be useful when we start output these features.
It will show up in output json, so let's make it a bit clearer what this is
ivan-aksamentov
commented
Mar 10, 2025
ivan-aksamentov
commented
Mar 10, 2025
ivan-aksamentov
commented
Mar 10, 2025
ivan-aksamentov
commented
Mar 10, 2025
GFF: CDS strand is fixed in this commit. But genes, to my understanding, don't have strand orientation, because they can consist of multiple CDS segments with different orientation. So genes I left genes with `.` instead of a particular strand value. TBL: for CDS the range boundaries are correctly swapped, so no changes are needed. For genes, similarly to GFF, the strandedness is undefined, so I leave genes with forward orientation always.
… from the queries
Feat/use qry coords
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #980
Resolves #1571
Supersedes #982 (closes #982)
The idea is to output genome annotation which is similar to the input reference annotation, but for query sequences.
This is done by cloning the reference annotation and adjustung coordinates and some other fields to match a query sequence. A stream of these annotations is being written to the newly introduced output files GFF3 and TBL, as well as to the existing JSON/NDJSON files.
Work items:
--output-annotation-tbl
to output GenBank.tbl
file with query annotations--output-annotation-gff
to output.gff
file with query annotationsannotation
field containing JSON dump of the internalGeneMap
struct into each entry in theresults
array in JSON and NDJSON outputs (the schemas of these formats are unstable, as before)seqid
column (column 1) set to sequence name in the input fasta file>Feature
block header contains sequence name in the input fasta fileDeviations/particularities
"source" column (column 2) in GFF is set to "nextclade"
adds
seq_index
attribute, to allow to identify query sequences by index in fasta file, as opposed to by name only (unreliable in presence of duplicate names)adds
is_reverse_complement
attribute when--retry-reverse-complement
results in reverse-complemented sequence output. (Note that sequence name also contains a suffix in this case).virtual genes and CDSes which were not present in the input reference annotation are being written. Nextclade creates them when input annotation contains only CDSes or only genes.
Name
attribute is always added if not present. There are complex rules to deduce the optimal name when Nextclade parses input gff annotation. The result of it is added to output attributes if there's noName
attribute yet.comments and pragmas from reference annotation file are not being output. Comments are not being parsed at all, and are lost in the output. The sequence-region pragma is parsed, but will require some more work to output it. The rest of pragmas are lost.
gene boundaries (not physically existing in CDS-based world, but somehow required in both, the GFF and TBL files) are calculated as a min/max of the boundaries of child CDSes
I think that
codon_start
in TBL is the same asphase
in GFF, but not surestrand orientation:
GFF: CDS strand is written as is. Genes, to my understanding, don't have strand orientation, because they can consist of multiple CDS segments with different orientation. So genes I left genes with
.
instead of a particular strand value (even if ).TBL: for CDS the range boundaries are correctly swapped if strand is nagative. For genes, similarly to GFF, the strandedness is undefined, so I leave genes with forward orientation always.
does any of it need to be adjusted if we reverse-complement the query when
--retry-reverse-complement
is passed?"translation" attribute is removed from output, to avoid emitting reference translations into query annotation
Defects
currently the seqid column in gff and feature name in tbl contain full sequence name line (id+ desc). Nextclade traditionally passes this value though without parsing, and up until now we haven't dealt with the restricted file firmats which don't allow spaces. In order to fix this, we need to split the
seq_name
on first space, either in fasta parser or when writing the gff and tbl files.empty or semi-empty files are written when there's no input annotation