Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

determine how to handle converting between formats #498

Open
jonorthwash opened this issue Jul 19, 2022 · 2 comments
Open

determine how to handle converting between formats #498

jonorthwash opened this issue Jul 19, 2022 · 2 comments
Labels
backend data management related to storage and transportation of trees and treebanks question

Comments

@jonorthwash
Copy link
Owner

Currently there are some issues related to converting between formats.

One problem with formats is that converting between them is always lossy. Even between CoNLL-U and CG3, quite a bit is lost. For example, only CoNLL-U supports enhanced dependencies and a difference between X/UPOSTAGS, and CG3 and CoNLL-U handle subtokens differently (and store different information about them, I think?).

So if the user would like to edit the corpus in a different format, and we try to preserve some of the information not native to that format in an underlying format, then when they modify the number or position of tokens, or modify information related to non-visible information, then things could easily get lost, or at least lost track of.

We have a few options for how to deal with this:

  1. We could just leave it as is, where data loss just always happens,
  2. We could make it harder to switch formats—or at least to switch formats and edit the new format. Perhaps make different formats view-only by default, and then display a modal when the user tries to start editing in a different format than the corpus is "stored in" (or was originally in), along the lines of "You will lose data—only proceed if you're okay with that!"
  3. We could try to keep track of data that is going to be lost more carefully so that it's only really ever lost if the user does something that disrupts a particular token or the ability to keep track of associated data. As opposed to just replacing the stored corpus with the new format. This would require implementing a better "format-neutral" way of storing data than what is already in notatrix.

What is preferred? Other ideas?

@jonorthwash jonorthwash added question backend data management related to storage and transportation of trees and treebanks labels Jul 19, 2022
@jonorthwash
Copy link
Owner Author

Note from @ftyers, @mr-martian, and @TinoDidriksen: Enhanced dependencies are possible in CG3 using relations.

@jonorthwash
Copy link
Owner Author

@ftyers prefers 2 or 3. I suggest 3 as the end goal, but maybe going with 2 as an easier short-term goal / a stop-gap for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend data management related to storage and transportation of trees and treebanks question
Projects
None yet
Development

No branches or pull requests

1 participant