Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] Preprocessing requires a handwritten parser approach #22

Open
Lotes opened this issue Jan 22, 2025 · 1 comment
Open

[discussion] Preprocessing requires a handwritten parser approach #22

Lotes opened this issue Jan 22, 2025 · 1 comment
Labels
priority-medium Should be resolved in 1-3 sprints

Comments

@Lotes
Copy link
Contributor

Lotes commented Jan 22, 2025

We think there is a fundamental issue with using Langium and with the underlying library Chevrotain when it comes to implementing a preprocessor. Why?

  • Langium manages only one AST per document. But since we are having a transformation step before, we actually have two versions of a document: unexpanded and expanded.
  • The lexer/parser architecture of Langium does not allow a efficient rescanning or reparsing of an internally changed document
  • Also things like skipping the next n lines with %SKIP are not easily to solve with Langium. There is no control down to Chevrotain instructions. Only controlable with the granularity of the grammar constructs.
  • Similar thing goes for %activating and %deactivating variables, we don’t have not such an abstraction of control/template flow.
  • Code like the following are difficult to handle in Chevrotain:
%dcl A char;
%A = 'B‘;
dcl A%C fixed bin(31); //becomes dcl BC fixed bin(31);

It would be worth to go for a hand-written parser for the preprocessor, that allows us to control the flow or let us ignore parts of the input line-wise.
Having Langium for the second phase (all macros expanded) should be less a problem. But we might be required to find a mapping between expanded and unexpanded code in order to add Language Server Support for the preprocessor statements.

@msujew
Copy link
Contributor

msujew commented Jan 22, 2025

Possible Implementation Approach

The PL/I compiler doesn’t use a traditional lexer/parser architecture. Instead, it goes through the input character-by-character, expanding and evaluating preprocessor code as needed.

The proposed solution should do the same, going through the input character by character, simulating the PL/I compiler during its preprocessing phase. A naive implementation of this approach would however lead to a few problems:

  • Completely evaluating all preprocessor code would result in those preprocessor symbols being lost, thereby preventing LSP navigation to those symbols.
  • Running everything on the evaluated text, will lead to offset inconsistencies with the original text. As text has been expanded/contracted, offsets will longer be accurate.

Instead, there are multiple ways to tackle this issue. This implementation approach outlines a way of incorporating Langium into the parsing process, reducing the amount of work required to support preprocessor directives in a first step. In a second step, we can completely replace Langium:

  1. Write a new Langium Lexer implementation from scratch. This implementation behaves as follows. Go through the input character by character:
    1. If encountering a “normal” token (i.e. keyword, ID, etc.) yield the corresponding token instance
    2. If encountering a preprocessor directive (starting with %), execute/interpret the code within
    3. If encountering a preprocessor variable in normal code, expand the variable with its text:
      1. Within the context of the expansion, perform rescanning:
      2. This effectively means perform steps a-c again
      3. Once the whole variable has been expanded, modify the resulting tokens to have the offset of the preprocessor variable and return those new tokens.
      4. Revert all offset data to be at the end of the processed variable and continue lexing
  2. The previous step yields a list of tokens of the expanded code, however with the added benefit that offsets have been kept inline with the original text.
  3. Parsing can continue as normal

@montymxb montymxb added the priority-medium Should be resolved in 1-3 sprints label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-medium Should be resolved in 1-3 sprints
Projects
None yet
Development

No branches or pull requests

3 participants