Unable to use tree sitter as parser for compiler #1631

seanyoung · 2022-02-01T10:23:42Z

I've attempted to use tree-sitter for the Solang Solidity Compiler, however I found it impossible to use because:

When walking the parse tree, only named or visible nodes can found. missing nodes cannot, which makes it possible to find out what nodes are missing in an error node (needed for nice error message, more than just "parse error").
When there is a parse error, we need to know what tokens would be acceptable, so we can give a nice compiler message saying.

As fas as I can see, tree sitter has a great properties (GLR, great error recovery, context sensitive keywords) but it is not usable in its current form, unless you can assume that the input is already valid.

ahelwer · 2022-02-01T13:17:19Z

Tree-sitter is designed for use with language tooling (highlighting, code folding, formatting, code navigation, etc.) rather than to be the backbone of a compiler or interpreter. Thus it is permissive and does not allow for error messages.

razzeee · 2022-02-01T13:28:25Z

#255 (comment)

seanyoung · 2022-02-01T13:40:54Z

#255 (comment)

That isn't going to cut it. You want a human-readable error string with a position. Ideally you want to be able to traverse the tree and get the missing nodes, so that you know exactly where the error happened, so you can do your best to resolve the rest of the parse tree and provide as many errors and warning as possible.

As far as I know, there is no other parser generator available for rust which is GLR with parse error recovery and context sensitive keywords.

stephe-ada-guru · 2022-02-02T11:57:09Z

Andrew Helwer ***@***.***> writes:

Tree-sitter is designed for use with language tooling (highlighting, code folding, formatting, code navigation, etc.) rather than to be the backbone of a compiler or interpreter. Thus it is permissive and does not allow for error messages.

But one feature of language tooling is providing code diagnostics to help the user fix syntax errors without having to run the full compiler. LSP provides for that; most IDEs I've used support it (Android Studio, Emacs). So it would be useful for tree-sitter to provide some form of error messages.

…

-- -- Stephe

seanyoung · 2022-02-02T12:05:07Z

Considering the tree sitter runtime is written in C, I think it might be time for a new pure rust runtime with a better api.

ahelwer · 2022-02-02T13:25:06Z

Usually tree-sitter is used as a fast first-pass parser before the slower actual parser comes up with the syntax errors. I think there are benefits to keeping tree-sitter simple so grammars stay easy to write.

Most things that consume tree-sitter grammars download the C files and compile them locally, which simplifies the release process. Nearly everything has a C compiler installed. Less so for rust. The minimal runtime requirements are a feature.

seanyoung · 2022-02-02T14:53:14Z

Usually tree-sitter is used as a fast first-pass parser before the slower actual parser comes up with the syntax errors. I think there are benefits to keeping tree-sitter simple so grammars stay easy to write.

As far as I can make out, tree-sitter grammars are complete GLR grammars. What else would a "slower actual parser" need which is not in a tree-sitter grammar? Conversely what makes the tree-sitter parser inherently faster (apart from an optimized implementation)?

Most things that consume tree-sitter grammars download the C files and compile them locally, which simplifies the release process. Nearly everything has a C compiler installed. Less so for rust. The minimal runtime requirements are a feature.

That is true, but it is also true that C is not particularly secure. Also, the tree sitter C files are not easy to understand. That's nothing to do with the C language. I suspect it's very optimized (at the expensive of readability).

ahelwer · 2022-02-02T15:10:44Z

Most fully-featured language parsers have limited error recovery capabilities and probably not incremental parsing. Tree-sitter is fast because it has incremental parsing (only reparsing the parts of the file that have changed), usually fast enough to re-parse the file on every keystroke as the user edits the file. Tree-sitter grammars are mostly LR(1) and fall back to GLR in the case of LR(1) conflicts. Tree-sitter is also a syntax-only parser and does not have any capability for adding semantic analysis like variables not being defined or incorrect function arity; full parsers include these things with their error messages. Tree-sitter grammars are also permissive; it is intended that they accept syntax which is not fully correct, to facilitate ease of writing the parser & efficiency of parsing while satisfying their use case.

The C file generated by tree-sitter defines a gigantic LR(1) state machine and lexer with multiple entry points. It is not really intended to be human readable but you can pick out the functionality if you have experience writing an external scanner for your tree-sitter grammar.

seanyoung · 2022-02-02T15:36:12Z

Most fully-featured language parsers have limited error recovery capabilities and probably not incremental parsing. Tree-sitter is fast because it has incremental parsing (only reparsing the parts of the file that have changed), usually fast enough to re-parse the file on every keystroke as the user edits the file.

This is a very nice feature of tree sitter, which would also be super helpful for a compiler, when it is being used in a code editor as a language server (solang can do this for example). You want the compiler to parse and do semantic analysis as fast as possible, so the developer gets fast feedback as they type code, even when their code is broken (which it is most of the time).

Tree-sitter grammars are mostly LR(1) and fall back to GLR in the case of LR(1) conflicts.

That is exactly the definition of GLR. Again something you want in "full parsers".

Tree-sitter is also a syntax-only parser and does not have any capability for adding semantic analysis like variables not being defined or incorrect function arity; full parsers include these things with their error messages.

Tree sitter is a parser generator, a parser does not do semantic analysis. Again that's no different from any other parser generator.

Tree-sitter grammars are also permissive; it is intended that they accept syntax which is not fully correct, to facilitate ease of writing the parser & efficiency of parsing while satisfying their use case.

The parser for a compiler should also accept syntax that is not fully correct, in order to give as many warnings/error messages as possible. For example if you have for while (;;) in one function you don't want the parser to bail out, because then you're missing all the warnings/errors for all your other functions.

So you're describing a feature you want in "full parser" (as in parser + semantic analysis). Tree-sitter would be ideal in this situation too: unfortunately it cannot be used because of a few elementary features.

The C file generated by tree-sitter defines a gigantic LR(1) state machine and lexer with multiple entry points. It is not really intended to be human readable but you can pick out the functionality if you have experience writing an external scanner for your tree-sitter grammar.

I mean this library https://github.com/tree-sitter/tree-sitter/tree/master/lib/src

I think by full parser you mean a parser + semantic analysis. My hope is to use tree sitter for the parser.

I still believe that with a few tweaks tree-sitter could be a fantastic parser generator for compilers, and it would a huge improvement on what's available currently (at least for rust, other languages too).

maxbrunsfeld · 2022-02-02T17:18:08Z

Yeah, I'd still like to add support for generating error messages in Tree-sitter. I think it would "fit" in fine to the current library; it'd probably be a function that takes an ERROR node and returns some structured information about the location where the error was first detected, and which state the parser was in when the error was detected. From there, you could call another function to iterate over all of the valid symbols in that parse state.

seanyoung · 2022-02-03T10:02:24Z

@maxbrunsfeld I think that would be perfect, thanks!

stephe-ada-guru · 2022-02-04T17:35:53Z

Max Brunsfeld ***@***.***> writes:

Yeah, I'd still like to add support for generating error messages in Tree-sitter. I think it would "fit" in fine to the current library; it'd probably be a function that takes an `ERROR` node and returns some structured information about the location where the error was first detected, and which *state* the parser was in when the error was detected. From there, you could call another function to iterate over all of the *valid* symbols in that parse state.

+1. That's what wisitoken does.

…

-- -- Stephe

mattfysh · 2023-07-23T09:20:35Z

Are there any new thoughts on this issue?

For my custom language, I'm currently maintaining implementations of lexers, parsers, syntax highlighters (textmate, monarch). When I heard about tree-sitter I had hoped I might be able to consolidate these efforts, but it is sounding like tree sitter is only appropriate for tooling, and less so for producing ASTs for compilers/interpreters.

WillsterJohnson · 2023-10-25T14:50:12Z

There's a lot of effort going into debating whether X tool is this thing or is that.

The reality is, not a soul on this earth wants to write nor maintain more than 1 (one) parser for a given language. Tree Sitter has an excellent opportunity to be the tool which provides language authors the ability to base everything they need to off of one parser implementation, I fail to comprehend any reason Tree Sitter shouldn't fill that space.

StachuDotNet · 2024-08-05T15:36:03Z

For whatever it's worth (sort of related, sort of tangential), we do use tree-sitter as a feed to our interpreter, LSP-based language server, etc.

We make this work like this:

we have our grammar (https://github.com/darklang/dark/blob/main/tree-sitter-darklang/grammar.js)
we wrote a bit of code that maps a tree-sitter node to a simplified view of that node (https://github.com/darklang/dark/blob/main/packages/darklang/languageTools/parser/core.dark#L5-L27 for simple ParsedNode def, https://github.com/darklang/dark/blob/main/backend/src/BuiltinExecution/Libs/Parser.fs#L24-L108)
some code maps the ParsedNode to a properly-typed AST thing (we call that WrittenTypes), heavily relying on typ in that^ (https://github.com/darklang/dark/tree/main/packages/darklang/languageTools/parser throughout here)
semantic tokenization based on the WrittenTypes stuff (this turns out to be really easy -- https://github.com/darklang/dark/blob/main/packages/darklang/languageTools/lsp-server/semanticTokens.dark)
(for interpreter, etc) we map from WrittenTypes to a ProgramTypes for our 'real' AST (this includes name-resolution etc)

It's a bit inelegant, but it means we only have one parser to deal with. There are also a few ways we could simplify some things. we'll get to some day.

None of this has fast re-parsing (using previously-parsed thing) but in our use case, we basically never have to parse a lot of code at once, so that's OK.

so I think the path of:

parsing with tree-sitter
feeding that parsed Node/tree and mapping to AST stuff for 'general purpose'

is definitely doable, and just wanted to give my 2c there.

My biggest gripe with this solution is that the parser isn't composable but that's kinda tangential here.

Some honest related thoughts/reservations on our unconventional approach, though: darklang/dark#5259 (comment)

milahu mentioned this issue Mar 18, 2022

Note on possible lossless parsers tweag/nickel#658

Merged

mtiller mentioned this issue Apr 1, 2022

Partial validation through highlighting Menci/monaco-tree-sitter#14

Open

ahlinc mentioned this issue Jul 17, 2023

Api extensions: previous sibling, last child, lookahead iterator #2324

Merged

archseer mentioned this issue Jul 31, 2023

Guile Scheme support 6cdh/tree-sitter-scheme#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use tree sitter as parser for compiler #1631

Unable to use tree sitter as parser for compiler #1631

seanyoung commented Feb 1, 2022

ahelwer commented Feb 1, 2022

razzeee commented Feb 1, 2022

seanyoung commented Feb 1, 2022

stephe-ada-guru commented Feb 2, 2022 via email

seanyoung commented Feb 2, 2022

ahelwer commented Feb 2, 2022 •

edited

Loading

seanyoung commented Feb 2, 2022

ahelwer commented Feb 2, 2022 •

edited

Loading

seanyoung commented Feb 2, 2022

maxbrunsfeld commented Feb 2, 2022

seanyoung commented Feb 3, 2022

stephe-ada-guru commented Feb 4, 2022 via email

mattfysh commented Jul 23, 2023

WillsterJohnson commented Oct 25, 2023

StachuDotNet commented Aug 5, 2024 •

edited

Loading

Unable to use tree sitter as parser for compiler #1631

Unable to use tree sitter as parser for compiler #1631

Comments

seanyoung commented Feb 1, 2022

ahelwer commented Feb 1, 2022

razzeee commented Feb 1, 2022

seanyoung commented Feb 1, 2022

stephe-ada-guru commented Feb 2, 2022 via email

seanyoung commented Feb 2, 2022

ahelwer commented Feb 2, 2022 • edited Loading

seanyoung commented Feb 2, 2022

ahelwer commented Feb 2, 2022 • edited Loading

seanyoung commented Feb 2, 2022

maxbrunsfeld commented Feb 2, 2022

seanyoung commented Feb 3, 2022

stephe-ada-guru commented Feb 4, 2022 via email

mattfysh commented Jul 23, 2023

WillsterJohnson commented Oct 25, 2023

StachuDotNet commented Aug 5, 2024 • edited Loading

ahelwer commented Feb 2, 2022 •

edited

Loading

ahelwer commented Feb 2, 2022 •

edited

Loading

StachuDotNet commented Aug 5, 2024 •

edited

Loading