determine how to handle converting between formats #498

jonorthwash · 2022-07-19T03:45:08Z

Currently there are some issues related to converting between formats.

One problem with formats is that converting between them is always lossy. Even between CoNLL-U and CG3, quite a bit is lost. For example, only CoNLL-U supports enhanced dependencies and a difference between X/UPOSTAGS, and CG3 and CoNLL-U handle subtokens differently (and store different information about them, I think?).

So if the user would like to edit the corpus in a different format, and we try to preserve some of the information not native to that format in an underlying format, then when they modify the number or position of tokens, or modify information related to non-visible information, then things could easily get lost, or at least lost track of.

We have a few options for how to deal with this:

We could just leave it as is, where data loss just always happens,
We could make it harder to switch formats—or at least to switch formats and edit the new format. Perhaps make different formats view-only by default, and then display a modal when the user tries to start editing in a different format than the corpus is "stored in" (or was originally in), along the lines of "You will lose data—only proceed if you're okay with that!"
We could try to keep track of data that is going to be lost more carefully so that it's only really ever lost if the user does something that disrupts a particular token or the ability to keep track of associated data. As opposed to just replacing the stored corpus with the new format. This would require implementing a better "format-neutral" way of storing data than what is already in notatrix.

What is preferred? Other ideas?

jonorthwash · 2022-07-21T05:12:37Z

Note from @ftyers, @mr-martian, and @TinoDidriksen: Enhanced dependencies are possible in CG3 using relations.

jonorthwash · 2022-07-21T05:13:28Z

@ftyers prefers 2 or 3. I suggest 3 as the end goal, but maybe going with 2 as an easier short-term goal / a stop-gap for now.

jonorthwash added question backend data management related to storage and transportation of trees and treebanks labels Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

determine how to handle converting between formats #498

determine how to handle converting between formats #498

jonorthwash commented Jul 19, 2022

jonorthwash commented Jul 21, 2022

jonorthwash commented Jul 21, 2022

determine how to handle converting between formats #498

determine how to handle converting between formats #498

Comments

jonorthwash commented Jul 19, 2022

jonorthwash commented Jul 21, 2022

jonorthwash commented Jul 21, 2022