Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use tree sitter as parser for compiler #1631

Open
seanyoung opened this issue Feb 1, 2022 · 15 comments
Open

Unable to use tree sitter as parser for compiler #1631

seanyoung opened this issue Feb 1, 2022 · 15 comments

Comments

@seanyoung
Copy link

I've attempted to use tree-sitter for the Solang Solidity Compiler, however I found it impossible to use because:

  • When walking the parse tree, only named or visible nodes can found. missing nodes cannot, which makes it possible to find out what nodes are missing in an error node (needed for nice error message, more than just "parse error").
  • When there is a parse error, we need to know what tokens would be acceptable, so we can give a nice compiler message saying.

As fas as I can see, tree sitter has a great properties (GLR, great error recovery, context sensitive keywords) but it is not usable in its current form, unless you can assume that the input is already valid.

@ahelwer
Copy link
Contributor

ahelwer commented Feb 1, 2022

Tree-sitter is designed for use with language tooling (highlighting, code folding, formatting, code navigation, etc.) rather than to be the backbone of a compiler or interpreter. Thus it is permissive and does not allow for error messages.

@razzeee
Copy link
Contributor

razzeee commented Feb 1, 2022

#255 (comment)

@seanyoung
Copy link
Author

#255 (comment)

That isn't going to cut it. You want a human-readable error string with a position. Ideally you want to be able to traverse the tree and get the missing nodes, so that you know exactly where the error happened, so you can do your best to resolve the rest of the parse tree and provide as many errors and warning as possible.

As far as I know, there is no other parser generator available for rust which is GLR with parse error recovery and context sensitive keywords.

@stephe-ada-guru
Copy link

stephe-ada-guru commented Feb 2, 2022 via email

@seanyoung
Copy link
Author

Considering the tree sitter runtime is written in C, I think it might be time for a new pure rust runtime with a better api.

@ahelwer
Copy link
Contributor

ahelwer commented Feb 2, 2022

Usually tree-sitter is used as a fast first-pass parser before the slower actual parser comes up with the syntax errors. I think there are benefits to keeping tree-sitter simple so grammars stay easy to write.

Most things that consume tree-sitter grammars download the C files and compile them locally, which simplifies the release process. Nearly everything has a C compiler installed. Less so for rust. The minimal runtime requirements are a feature.

@seanyoung
Copy link
Author

Usually tree-sitter is used as a fast first-pass parser before the slower actual parser comes up with the syntax errors. I think there are benefits to keeping tree-sitter simple so grammars stay easy to write.

As far as I can make out, tree-sitter grammars are complete GLR grammars. What else would a "slower actual parser" need which is not in a tree-sitter grammar? Conversely what makes the tree-sitter parser inherently faster (apart from an optimized implementation)?

Most things that consume tree-sitter grammars download the C files and compile them locally, which simplifies the release process. Nearly everything has a C compiler installed. Less so for rust. The minimal runtime requirements are a feature.

That is true, but it is also true that C is not particularly secure. Also, the tree sitter C files are not easy to understand. That's nothing to do with the C language. I suspect it's very optimized (at the expensive of readability).

@ahelwer
Copy link
Contributor

ahelwer commented Feb 2, 2022

Most fully-featured language parsers have limited error recovery capabilities and probably not incremental parsing. Tree-sitter is fast because it has incremental parsing (only reparsing the parts of the file that have changed), usually fast enough to re-parse the file on every keystroke as the user edits the file. Tree-sitter grammars are mostly LR(1) and fall back to GLR in the case of LR(1) conflicts. Tree-sitter is also a syntax-only parser and does not have any capability for adding semantic analysis like variables not being defined or incorrect function arity; full parsers include these things with their error messages. Tree-sitter grammars are also permissive; it is intended that they accept syntax which is not fully correct, to facilitate ease of writing the parser & efficiency of parsing while satisfying their use case.

The C file generated by tree-sitter defines a gigantic LR(1) state machine and lexer with multiple entry points. It is not really intended to be human readable but you can pick out the functionality if you have experience writing an external scanner for your tree-sitter grammar.

@seanyoung
Copy link
Author

Most fully-featured language parsers have limited error recovery capabilities and probably not incremental parsing. Tree-sitter is fast because it has incremental parsing (only reparsing the parts of the file that have changed), usually fast enough to re-parse the file on every keystroke as the user edits the file.

This is a very nice feature of tree sitter, which would also be super helpful for a compiler, when it is being used in a code editor as a language server (solang can do this for example). You want the compiler to parse and do semantic analysis as fast as possible, so the developer gets fast feedback as they type code, even when their code is broken (which it is most of the time).

Tree-sitter grammars are mostly LR(1) and fall back to GLR in the case of LR(1) conflicts.

That is exactly the definition of GLR. Again something you want in "full parsers".

Tree-sitter is also a syntax-only parser and does not have any capability for adding semantic analysis like variables not being defined or incorrect function arity; full parsers include these things with their error messages.

Tree sitter is a parser generator, a parser does not do semantic analysis. Again that's no different from any other parser generator.

Tree-sitter grammars are also permissive; it is intended that they accept syntax which is not fully correct, to facilitate ease of writing the parser & efficiency of parsing while satisfying their use case.

The parser for a compiler should also accept syntax that is not fully correct, in order to give as many warnings/error messages as possible. For example if you have for while (;;) in one function you don't want the parser to bail out, because then you're missing all the warnings/errors for all your other functions.

So you're describing a feature you want in "full parser" (as in parser + semantic analysis). Tree-sitter would be ideal in this situation too: unfortunately it cannot be used because of a few elementary features.

The C file generated by tree-sitter defines a gigantic LR(1) state machine and lexer with multiple entry points. It is not really intended to be human readable but you can pick out the functionality if you have experience writing an external scanner for your tree-sitter grammar.

I mean this library https://github.com/tree-sitter/tree-sitter/tree/master/lib/src

I think by full parser you mean a parser + semantic analysis. My hope is to use tree sitter for the parser.

I still believe that with a few tweaks tree-sitter could be a fantastic parser generator for compilers, and it would a huge improvement on what's available currently (at least for rust, other languages too).

@maxbrunsfeld
Copy link
Contributor

Yeah, I'd still like to add support for generating error messages in Tree-sitter. I think it would "fit" in fine to the current library; it'd probably be a function that takes an ERROR node and returns some structured information about the location where the error was first detected, and which state the parser was in when the error was detected. From there, you could call another function to iterate over all of the valid symbols in that parse state.

@seanyoung
Copy link
Author

@maxbrunsfeld I think that would be perfect, thanks!

@stephe-ada-guru
Copy link

stephe-ada-guru commented Feb 4, 2022 via email

@mattfysh
Copy link

Are there any new thoughts on this issue?

For my custom language, I'm currently maintaining implementations of lexers, parsers, syntax highlighters (textmate, monarch). When I heard about tree-sitter I had hoped I might be able to consolidate these efforts, but it is sounding like tree sitter is only appropriate for tooling, and less so for producing ASTs for compilers/interpreters.

@WillsterJohnson
Copy link

There's a lot of effort going into debating whether X tool is this thing or is that.

The reality is, not a soul on this earth wants to write nor maintain more than 1 (one) parser for a given language. Tree Sitter has an excellent opportunity to be the tool which provides language authors the ability to base everything they need to off of one parser implementation, I fail to comprehend any reason Tree Sitter shouldn't fill that space.

@StachuDotNet
Copy link
Contributor

StachuDotNet commented Aug 5, 2024

For whatever it's worth (sort of related, sort of tangential), we do use tree-sitter as a feed to our interpreter, LSP-based language server, etc.

We make this work like this:

It's a bit inelegant, but it means we only have one parser to deal with. There are also a few ways we could simplify some things. we'll get to some day.

None of this has fast re-parsing (using previously-parsed thing) but in our use case, we basically never have to parse a lot of code at once, so that's OK.

so I think the path of:

  • parsing with tree-sitter
  • feeding that parsed Node/tree and mapping to AST stuff for 'general purpose'

is definitely doable, and just wanted to give my 2c there.

My biggest gripe with this solution is that the parser isn't composable but that's kinda tangential here.

Some honest related thoughts/reservations on our unconventional approach, though: darklang/dark#5259 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants