Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexing does not appear to respect the declared order of lexer rules. #164

Open
NigelWSewell opened this issue Jul 16, 2023 · 3 comments
Open
Labels
documentation Improvements or additions to documentation

Comments

@NigelWSewell
Copy link

Description

While writing a JavaDoc Extractor, it was seen that the Lexing rules do not appear to follow the description in the Documentation. Where it is stated:

The order in which terminal rules are defined is critical as the lexer will always return the first match.

In the First Screenshot, the grammar can be seen to be extracting the correct text in the syntax tree, so the task is therefore to define some terminal rules that ignore everything else.

Screenshot from 2023-07-16 15-18-57

Adding the 'IGNORE' rule we can see that the syntax tree has removed the earlier matches, in favour of the later 'IGNORE' rule.

Screenshot from 2023-07-16 15-19-25

This seems to be in contradiction to the expectation from the requirement about the order of terminal rules.

Grammar Used

grammar JavaDocExtractor

entry Model: (docs+=JDoc)*;

terminal JDoc: ('/**' -> '*/');

hidden terminal CR: '\r'+;
hidden terminal LF: '\n'+;
//hidden terminal IGNORE: /.+?/;

Test Input


/** foo 1 */
person John
person Jane

/* foo 2*/


Hello John!
Hello Jane!

/** foo 4*/
@msujew
Copy link
Member

msujew commented Jul 16, 2023

@NigelWSewell It seems like the documentation skipped over the small detail that we move terminals that can potentially match whitespace characters to the front as a performance optimization. See here.

Note that unlike in Xtext, it's not recommended in Langium to have a catch-all terminal. Langium's underlying lexer implementation (Chevrotain) works quite differently from ANTLR and catch-all terminals will always lead to trouble (even if the order of tokens is correct). A catch-all token will always consume the rest of the input, as even making it non-greedy doesn't work.

Instead, lexer errors are dealt with on a diagnostics level, and unexpected characters are simply omitted from the token stream.

@NigelWSewell
Copy link
Author

NigelWSewell commented Jul 16, 2023

@msujew That would explain the behaviour well eough.

Is there a workaround to this? Either:

  • A way of forcing strict declaration order.
  • Ignoring other syntax errors
  • A complete non-whitespace character set to catch other unwanted text.
  • Something else ive not thought of.

Either way im sure this is a question/mistake many people from ANTLR/XText will encounter so this can be a good opportunity to improve the documentation.

p.s.: Thanks for working on Sunday!

@msujew
Copy link
Member

msujew commented Jul 16, 2023

Is there a workaround to this?

Not directly in the grammar, though you can override the DefaultTokenBuilder to prevent the behavior. We should probably add a flag to disable the optimization.

Either way I'm sure this is a question/mistake many people from ANTLR/XText will encounter so this can be a good opportunity to improve the documentation.

I assume so as well. We should probably mention that in the docs.

@msujew msujew added recipe Improvements or additions to recipes documentation Improvements or additions to documentation and removed recipe Improvements or additions to recipes labels Jul 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants