Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use json string to store additional metadata in fasta headers #101

Merged
merged 1 commit into from
Jan 3, 2025

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented Jan 2, 2025

Proposal is to use stringified JSON to add any additional data to fasta headers as opposed to home-grown format.

Advantages:

  • Easier to parse: the fasta header can be split on first space and then the second part can be parsed as json to extract all metadata, which is basically 2 lines of code in most languages. The home-grown format would either have to be parsed specifically or cannot be parsed reliably at all, especially if not documented.
  • Easily recognizable without documentation: "oh, that's JSON! I can take the string after the first space and parse it with json.loads()"
  • Output contains field names, which might improve clarity if the meaning of values is not immediately obvious
  • New fields can be easily added and removed without any additional thinking about new delimiters, positions etc.
  • Escaping and unescaping of the invalid values is done by JSON writer and reader, so the output should always be parseable.
  • We could even define a schema and document the format to be super-strict and clear! (but probably too much for this particular case)
  • JSON is a compromise between human- and machine-readability

Disadvantages:

  • Need to double check if fasta headers allow json characters
  • Output is longer than just ad-hoc values
  • Output contains field names, which might be excessive if the meaning is already clear
  • It might unintentionally create an API surface in places where we don't want to maintain a public stable interface (but we can also put a disclaimer that the format is unstable and all fields are optional).
  • JSON is a compromise between human- and machine-readability

Example outputs:

Before:

>11571779012938514380 pCAV1344-40-1705098846223677255 [4961-9839|+]

After:

>11571779012938514380 {"path_name":"pCAV1344-40","block_id":1705098846223677255,"start":4961,"end":9839,"strand":"+"}

Note that in this particular example, because values are delimited with - and the path_name contains - it is difficult or impossible to reliably extract the values from the "before" format string. Changing delimiter won't help here, because path_name itself comes from fasta and can contain any characters. The only reliable way to write this is to have a list of reserved characters and a strategy to escape them.

Proposal is to use stringified JSON to add any additional data to fasta headers as opposed to home-grown format.

Advantages:
 - Easire to parse: the fasta header can be split on first space and then the second part can be parsed as json to extract all metadata, which is basically 2 lines of code in most languages. The home-grown format would either have to be parsed specifically or cannot be parsed reliably at all, especially if not documented.
- Output contains field names, which might improve clarity if the meaning of values is not immediately obvious
 - New fields can be easily added and removed without any additional thinking about new delimiters, positions etc.
 - We could even define a schema and document the format to be super-strict and clear! (but probably too much for this particular case)
- JSON is a compromise between human- and machine-readability

Disadvantages:
- Need to double check if fasta headers allows json characters
- Output is longer than just ad-hoc values
- Output contains field names, which might be excessive if the meaning is already clear
- JSON is a compromise between human- and machine-readability

Example outputs:

Before:
```
>11571779012938514380 pCAV1344-40-1705098846223677255 [4961-9839|+]
```

After:
```
>11571779012938514380 {"path_name":"pCAV1344-40","block_id":1705098846223677255,"start":4961,"end":9839,"strand":"+"}
```
@mmolari
Copy link
Collaborator

mmolari commented Jan 3, 2025

I like this proposal! Green light from me 🟢 I don't think there are forbidden characters in fasta descriptions

@ivan-aksamentov ivan-aksamentov merged commit 1ee6b2c into rust Jan 3, 2025
13 of 14 checks passed
@ivan-aksamentov ivan-aksamentov deleted the feat/json-metadata-in-fasta branch January 3, 2025 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants