[discussion] Preprocessing requires a handwritten parser approach #22

Lotes · 2025-01-22T09:37:32Z

We think there is a fundamental issue with using Langium and with the underlying library Chevrotain when it comes to implementing a preprocessor. Why?

Langium manages only one AST per document. But since we are having a transformation step before, we actually have two versions of a document: unexpanded and expanded.
The lexer/parser architecture of Langium does not allow a efficient rescanning or reparsing of an internally changed document
Also things like skipping the next n lines with %SKIP are not easily to solve with Langium. There is no control down to Chevrotain instructions. Only controlable with the granularity of the grammar constructs.
Similar thing goes for %activating and %deactivating variables, we don’t have not such an abstraction of control/template flow.
Code like the following are difficult to handle in Chevrotain:

%dcl A char;
%A = 'B‘;
dcl A%C fixed bin(31); //becomes dcl BC fixed bin(31);

It would be worth to go for a hand-written parser for the preprocessor, that allows us to control the flow or let us ignore parts of the input line-wise.
Having Langium for the second phase (all macros expanded) should be less a problem. But we might be required to find a mapping between expanded and unexpanded code in order to add Language Server Support for the preprocessor statements.

The text was updated successfully, but these errors were encountered:

msujew · 2025-01-22T09:48:00Z

Possible Implementation Approach

The PL/I compiler doesn’t use a traditional lexer/parser architecture. Instead, it goes through the input character-by-character, expanding and evaluating preprocessor code as needed.

The proposed solution should do the same, going through the input character by character, simulating the PL/I compiler during its preprocessing phase. A naive implementation of this approach would however lead to a few problems:

Completely evaluating all preprocessor code would result in those preprocessor symbols being lost, thereby preventing LSP navigation to those symbols.
Running everything on the evaluated text, will lead to offset inconsistencies with the original text. As text has been expanded/contracted, offsets will longer be accurate.

Instead, there are multiple ways to tackle this issue. This implementation approach outlines a way of incorporating Langium into the parsing process, reducing the amount of work required to support preprocessor directives in a first step. In a second step, we can completely replace Langium:

Write a new Langium Lexer implementation from scratch. This implementation behaves as follows. Go through the input character by character:
1. If encountering a “normal” token (i.e. keyword, ID, etc.) yield the corresponding token instance
2. If encountering a preprocessor directive (starting with %), execute/interpret the code within
3. If encountering a preprocessor variable in normal code, expand the variable with its text:
  1. Within the context of the expansion, perform rescanning:
  2. This effectively means perform steps a-c again
  3. Once the whole variable has been expanded, modify the resulting tokens to have the offset of the preprocessor variable and return those new tokens.
  4. Revert all offset data to be at the end of the processed variable and continue lexing
The previous step yields a list of tokens of the expanded code, however with the added benefit that offsets have been kept inline with the original text.
Parsing can continue as normal

montymxb added the priority-medium Should be resolved in 1-3 sprints label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] Preprocessing requires a handwritten parser approach #22

[discussion] Preprocessing requires a handwritten parser approach #22

Lotes commented Jan 22, 2025

msujew commented Jan 22, 2025

[discussion] Preprocessing requires a handwritten parser approach #22

[discussion] Preprocessing requires a handwritten parser approach #22

Comments

Lotes commented Jan 22, 2025

msujew commented Jan 22, 2025

Possible Implementation Approach