Removal of zero-width matches in generated tree-sitter grammar files, preliminary implementation #475
+163
−36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tree-sitter has a known limitation of not handling symbols that can match zero-width (a.k.a. "empty") strings. (Documented here) The only exception seems to be the start symbol.
However, in BNFC, zero-width matching symbols are supported and used in many examples and test cases. (e.g. The list macros in BNFC) It would be impractical to not support them in the tree-sitter backend.
It is possible to eliminate all zero-width matching symbols from a grammar: First, all symbols that could match zero-width need to be identified. Then these symbols can be converted to valid tree-sitter symbols by removing the RHS empty branches from its rules. Then all rules of the entire grammar need to be processed, wrapping all of the references to the aforementioned symbol with
optional(..)
. Finally since the addition ofoptional(..)
can populate zero-width-matching to other symbols, this needs to be done repeatedly until converging on a fixed point or the zero-width-matching is populated to the start symbol.This PR implements a preliminary version of the above-described algorithm. The converging process is not done yet and currently, it only handles simple cases of zero-width-matching where there is only one layer of matching. It should be sufficient enough for most cases like the list-related internal macros provided in BNFC.