-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use tree sitter as parser for compiler #1631
Comments
Tree-sitter is designed for use with language tooling (highlighting, code folding, formatting, code navigation, etc.) rather than to be the backbone of a compiler or interpreter. Thus it is permissive and does not allow for error messages. |
That isn't going to cut it. You want a human-readable error string with a position. Ideally you want to be able to traverse the tree and get the missing nodes, so that you know exactly where the error happened, so you can do your best to resolve the rest of the parse tree and provide as many errors and warning as possible. As far as I know, there is no other parser generator available for rust which is GLR with parse error recovery and context sensitive keywords. |
Andrew Helwer ***@***.***> writes:
Tree-sitter is designed for use with language tooling (highlighting,
code folding, formatting, code navigation, etc.) rather than to be the
backbone of a compiler or interpreter. Thus it is permissive and does
not allow for error messages.
But one feature of language tooling is providing code diagnostics to
help the user fix syntax errors without having to run the full compiler.
LSP provides for that; most IDEs I've used support it (Android
Studio, Emacs).
So it would be useful for tree-sitter to provide some form of error
messages.
…--
-- Stephe
|
Considering the tree sitter runtime is written in C, I think it might be time for a new pure rust runtime with a better api. |
Usually tree-sitter is used as a fast first-pass parser before the slower actual parser comes up with the syntax errors. I think there are benefits to keeping tree-sitter simple so grammars stay easy to write. Most things that consume tree-sitter grammars download the C files and compile them locally, which simplifies the release process. Nearly everything has a C compiler installed. Less so for rust. The minimal runtime requirements are a feature. |
As far as I can make out, tree-sitter grammars are complete GLR grammars. What else would a "slower actual parser" need which is not in a tree-sitter grammar? Conversely what makes the tree-sitter parser inherently faster (apart from an optimized implementation)?
That is true, but it is also true that C is not particularly secure. Also, the tree sitter C files are not easy to understand. That's nothing to do with the C language. I suspect it's very optimized (at the expensive of readability). |
Most fully-featured language parsers have limited error recovery capabilities and probably not incremental parsing. Tree-sitter is fast because it has incremental parsing (only reparsing the parts of the file that have changed), usually fast enough to re-parse the file on every keystroke as the user edits the file. Tree-sitter grammars are mostly LR(1) and fall back to GLR in the case of LR(1) conflicts. Tree-sitter is also a syntax-only parser and does not have any capability for adding semantic analysis like variables not being defined or incorrect function arity; full parsers include these things with their error messages. Tree-sitter grammars are also permissive; it is intended that they accept syntax which is not fully correct, to facilitate ease of writing the parser & efficiency of parsing while satisfying their use case. The C file generated by tree-sitter defines a gigantic LR(1) state machine and lexer with multiple entry points. It is not really intended to be human readable but you can pick out the functionality if you have experience writing an external scanner for your tree-sitter grammar. |
This is a very nice feature of tree sitter, which would also be super helpful for a compiler, when it is being used in a code editor as a language server (solang can do this for example). You want the compiler to parse and do semantic analysis as fast as possible, so the developer gets fast feedback as they type code, even when their code is broken (which it is most of the time).
That is exactly the definition of GLR. Again something you want in "full parsers".
Tree sitter is a parser generator, a parser does not do semantic analysis. Again that's no different from any other parser generator.
The parser for a compiler should also accept syntax that is not fully correct, in order to give as many warnings/error messages as possible. For example if you have So you're describing a feature you want in "full parser" (as in parser + semantic analysis). Tree-sitter would be ideal in this situation too: unfortunately it cannot be used because of a few elementary features.
I mean this library https://github.com/tree-sitter/tree-sitter/tree/master/lib/src I think by full parser you mean a parser + semantic analysis. My hope is to use tree sitter for the parser. I still believe that with a few tweaks tree-sitter could be a fantastic parser generator for compilers, and it would a huge improvement on what's available currently (at least for rust, other languages too). |
Yeah, I'd still like to add support for generating error messages in Tree-sitter. I think it would "fit" in fine to the current library; it'd probably be a function that takes an |
@maxbrunsfeld I think that would be perfect, thanks! |
Max Brunsfeld ***@***.***> writes:
Yeah, I'd still like to add support for generating error messages in
Tree-sitter. I think it would "fit" in fine to the current library;
it'd probably be a function that takes an `ERROR` node and returns
some structured information about the location where the error was
first detected, and which *state* the parser was in when the error was
detected. From there, you could call another function to iterate over
all of the *valid* symbols in that parse state.
+1. That's what wisitoken does.
…--
-- Stephe
|
Are there any new thoughts on this issue? For my custom language, I'm currently maintaining implementations of lexers, parsers, syntax highlighters (textmate, monarch). When I heard about tree-sitter I had hoped I might be able to consolidate these efforts, but it is sounding like tree sitter is only appropriate for tooling, and less so for producing ASTs for compilers/interpreters. |
There's a lot of effort going into debating whether X tool is this thing or is that. The reality is, not a soul on this earth wants to write nor maintain more than 1 (one) parser for a given language. Tree Sitter has an excellent opportunity to be the tool which provides language authors the ability to base everything they need to off of one parser implementation, I fail to comprehend any reason Tree Sitter shouldn't fill that space. |
For whatever it's worth (sort of related, sort of tangential), we do use tree-sitter as a feed to our interpreter, LSP-based language server, etc. We make this work like this:
It's a bit inelegant, but it means we only have one parser to deal with. There are also a few ways we could simplify some things. we'll get to some day. None of this has fast re-parsing (using previously-parsed thing) but in our use case, we basically never have to parse a lot of code at once, so that's OK. so I think the path of:
is definitely doable, and just wanted to give my 2c there. My biggest gripe with this solution is that the parser isn't composable but that's kinda tangential here. Some honest related thoughts/reservations on our unconventional approach, though: darklang/dark#5259 (comment) |
I've attempted to use tree-sitter for the Solang Solidity Compiler, however I found it impossible to use because:
As fas as I can see, tree sitter has a great properties (GLR, great error recovery, context sensitive keywords) but it is not usable in its current form, unless you can assume that the input is already valid.
The text was updated successfully, but these errors were encountered: