Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with ACT R library #56

Open
lucientisserand opened this issue Feb 6, 2024 · 2 comments
Open

Compatibility with ACT R library #56

lucientisserand opened this issue Feb 6, 2024 · 2 comments

Comments

@lucientisserand
Copy link

Expected behaviour
The Pympi's exported ELAN file should be opened by the Annotated Corpus Toolkit (ACT) or should be formatted as original ELAN file.

Actual behaviour
The exported ELAN file should be able to be processed by ACT or should be formatted as original ELAN.

System information

  • python version: 3.10
  • os: Linux Mint 21.3
  • are you up to date with the latest master?: yes 1.70.2

Additional context
I work both with Pympi and Oliver Ehmer's Annotated Corpus Tollkit for R (ACT) that are too great pieces of code for linguists working with ELAN.
I noticed that the ELAN files exported with pympi (with or without "pretty" parameter) could not be processed directly by ACT (see below).
However, they can if this file has been opened then saved in ELAN.
So I took a look at diffs between the pympi's fresh export and the ELAN overwrite and found these two located issues when importing pympi file in ACT :

  1. the file would not be loaded at all : apparently this error is due to the EAF version statement of the file for the attribute xsi:noNamespaceSchemaLocation (3.0 will be loaded, not 2.8).
  2. if issue 1 is corrected (2.8>3.0), the file is loaded but then the time values are not found by ACT : however it works if the "space" character before the TIME_SLOT closing tag is removed.

Workaround found
If I bulk replace version number (2.8>3.0) and if I bulk remove the space character before every closing XML tag, then the file is successfully processed by ACT.
Since the original ELAN files are not formatted as such, I though it was more a "pympi" issue rather than an "ACT" issue.
So maybe some slight export modifications are welcome in pympi ?

Thank you for your work,
Lucien

@dopefishh
Copy link
Owner

Extra spacing before closing XML tags signals a fragile XML parser from ACT's side. However, I'm not opposed to generating stricter XML without this spacing as it doesn't change the semantics of the file.
Increasing the version can be done, but we have to make sure that the generated file really is 3.0 compliant. Since the major version is increased, I assume there are some backwards incompatible changes between 2.x and 3.x.
The specification can be found here: https://www.mpi.nl/tools/elan/EAF_Annotation_Format_3.0_and_ELAN.pdf
In a previous issue we found that it probably is compatible though (#29).

So in short, yes please, I'm would be happy to accept merge requests for this.

@lucientisserand
Copy link
Author

Totally agree, when I have time I will have a look into the differences between 2.8 and 3.0 before trying to propose something (still a beginner in python but learning by doing). Also maybe proposing ACT to treat space character cases as it's compliant with XML syntax.
In the meantime I hope some people may find the workaround useful it they are blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants