Added Shrubbery parser as a sample #110

EliasC · 2024-04-26T17:01:14Z

This PR adds a sample that parses Shrubbery Notation. It complements the infix example by having a slightly more involved parsing pass (whereas the three rewriting passes are quite straightforward). In particular, it gives an example of handling indentation sensitivity. There is a README that gives an overview of Shrubbery parsing and the implementation. The source code has explanatory comments, but assumes that the reader has done the infix tutorial and therefore understands the basics of parsing and pattern matching.

The implementation relies on #90.

EliasC · 2024-04-26T17:06:08Z

@microsoft-github-policy-service agree

matajoh · 2024-06-03T08:14:36Z

Small question: where do the test files come from? If they have been copied in from another repo, could we potentially clone that repo as part of the build. to avoid having all of these in the repo? Not a big deal if not, just curious. You can see an example of what I'm talking about in the parser directory:

if(TRIESTE_BUILD_PARSER_TESTS)
    enable_testing()
    add_subdirectory(test)

    if(NOT IS_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/JSONTestSuite)
    execute_process(COMMAND ${GIT_EXECUTABLE} clone --depth=1 https://github.com/nst/JSONTestSuite
                    WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
                    OUTPUT_QUIET)
    endif()

    if(NOT IS_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/yaml-test-suite)
    execute_process(COMMAND ${GIT_EXECUTABLE} clone --depth=1 --branch data-2022-01-17 https://github.com/yaml/yaml-test-suite
                    WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
                    OUTPUT_QUIET)
    endif()
endif()

It would also be neat for these tests to be added either via a test driver like json_test.cc or by adding them in CMake:

add_test(NAME json_trieste COMMAND json_trieste test -f WORKING_DIRECTORY $<TARGET_FILE_DIR:json_trieste>)

Anything to have more robust CI :)

matajoh

Most of these comments are non-blocking, i.e., more suggestions for improvements that could be done in the future.

I would like (if possible) to not host the test files in this repo, though.

matajoh · 2024-06-03T08:26:08Z

samples/shrubbery/parse.cc

+  // on the previous line. When parsing the program above, the Indent will be
+  // ((0,0), 2), indicating that the established indentation level is line 0
+  // colum 0, and that this line is continued on column 2.
+  struct Indent {


As a general note on Trieste usage, there are two strategies that may make your life considerably easier as regards meaningful whitespace. The first, and perhaps simplest, is to make the parser even dumber and have it explicitly capture whitespace as Whitespace and NewLine tokens, i.e.:

R"(\r?\n)" >> [](auto& m){ m.add(NewLine); } R"([ \t]+)" >> [](auto& m){ m.add(Whitespace); }

You can then defer decisions about indents etc. to a dedicated pass in Trieste, where you'll have much more flexibility and context to make your decisions. This works rather well. Another option is to use much the same pattern, but to use pattern matching to measure the indent explicitly while processing the end of the previous line, like so:

R"(\r?\n([ \t]*))" >> [](auto& m){ size_t indent = m.match(1).len; // your indent logic }

This lookahead can save a bunch of bookkeeping (you can preemptively push or pop indent blocks).

I was thinking about a version with the explicit whitespace tokens, but tried writing the Shrubbery parser as a way to stress the parser module more than the existing samples (when this was developed, infix was the only other sample in the repo). It is also nice how, modulo whitespace sensitivity, the Trieste parser sort of looks like a generalized shrubbery parser. The second option does look a lot nicer than what I am doing right now though, so I will give that a spin!

matajoh · 2024-06-03T08:29:23Z

samples/shrubbery/shrubbery.h

@@ -0,0 +1,26 @@
+#pragma once
+
+#include <trieste/driver.h>


Best practice is now to include trieste/trieste.h.

matajoh · 2024-06-03T08:30:22Z

samples/shrubbery/shrubbery.h

+  inline const auto Id = TokenDef("id");
+  inline const auto Lhs = TokenDef("lhs");
+  inline const auto Rhs = TokenDef("rhs");
+


Best practice is to export your final WF definition here, as otherwise people will not be able to easily use the ASTs you produce downstream.

matajoh · 2024-06-03T08:32:46Z

samples/shrubbery/reader.cc

+    return {
+      "merge alternatives",
+      wf_alternatives,
+      dir::topdown,


I think this could be turned into a dir::once pass by adding ++ to the captures below? I'll give an example of what I mean.

matajoh · 2024-06-03T08:34:51Z

samples/shrubbery/reader.cc

+          [](Match& _) { return Alt << (Block << *_[Alt]); },
+
+        // Merge sequence of alternatives into a single alternative
+        (T(Alt)[Alt] * (T(Alt) << T(Block)[Block])) >>


Potential alternative to allow dir::once?

(T(Alt)[Alt] * (T(Alt) << T(Block)))++[Group] >> [](Match& _) { Node alt = _(Alt); for(auto& node : _[Group]){ alt << node->front(); } return alt; }

(NB this requires my PR which changes NodeRange to a span)

Having foreach loops for NodeRange looks very nice! Are the parentheses in the pattern off by one? Reading the code, I would have guessed that the Group capture should not include the Alt capture, but only the subsequent nodes (so that we can add their blocks under the alt node in the effect.

matajoh · 2024-06-03T08:35:38Z

samples/shrubbery/reader.cc

+    return {
+      "drop separators",
+      wf_no_semis_or_commas,
+      dir::topdown,


This can be a dir::bottomup|dir::once pass.

Are dir::bottomup passes more efficient than dir::topdown passes? I can see how dir::once is preferrable.

matajoh · 2024-06-03T08:36:00Z

samples/shrubbery/reader.cc

+    return {
+      "check parsing",
+      wf_check_parser,
+      dir::topdown,


This can be a dir::bottomup|dir::once pass I believe

EliasC · 2024-06-03T10:11:36Z

Small question: where do the test files come from? If they have been copied in from another repo, could we potentially clone that repo as part of the build. to avoid having all of these in the repo?

I wrote them myself. Some of them come from the documentation of shrubbery notation. I have been looking for an existing test suite without success (also there would need to be a filtering process since I don't support all of Shrubbery notation).

matajoh · 2024-06-03T13:53:40Z

Small question: where do the test files come from? If they have been copied in from another repo, could we potentially clone that repo as part of the build. to avoid having all of these in the repo?

I wrote them myself. Some of them come from the documentation of shrubbery notation. I have been looking for an existing test suite without success (also there would need to be a filtering process since I don't support all of Shrubbery notation).

OK in that case I suppose we can put them here for now. Maybe one potential improvement down the line would be to put them in a YAML file and write a test driver to process them.

EliasC · 2024-06-27T09:10:58Z

@matajoh I have addressed most of your comments now.

Every rewrite pass is now bottomup | once and generally looks nicer.
I removed all the examples for now and will keep them around locally until I figure out a way to add them in a nice way or find an external test suite. They are quite weak as test cases as you still need to manually look at the produced trees to see if they did the right thing.
I tried experimenting with your suggested solutions for whitespace, but didn't figure out a good to do it. The problem with counting whitespace after newlines is that a line could start with spaces but continue with a comment (or nothing) meaning it shouldn't be counted. Since there is no lookahead in the parsing step, I don't know whether to check indentation levels until I actually see a term that should be added. For adding explicit whitespace tokens, it becomes quite hairy for input like
```
one: two: three: four:
     two again
```
where two again should belong to the same block as two. It's doable, but requires non-trivial logic that in the end isn't a lot cleaner than what I'm doing now.

I also added a final rewriting pass that takes the tree to a shape that matches the grammar of shrubbery notation (this is presented in shrubbery.h). Later it might make sense to add a writer that outputs the parsed program as an S-expression (or indeed as shrubbery notation), but for now I think it works as an example to learn from.

mjp41 · 2024-09-16T14:16:02Z

samples/infix/.DS_Store

Don't commit this file.

These changes have been addressed.

matajoh previously requested changes Jun 3, 2024

View reviewed changes

EliasC force-pushed the sample/shrubbery branch from f2ec8fc to fd37185 Compare June 27, 2024 08:49

EliasC added 12 commits September 18, 2024 14:51

Added shrubbery sample

d87c3cd

Clarifications about operators

82752b1

Added disclaimer

947fd88

Minor cleanup

fd94e54

Fix CMakeList

6e5daac

Removed unused wf specification

4a2fa12

Plans for the future

d11a656

Final group structure

2468858

Remove unneccesary passes and fix bug

ca00ab6

Minor cleanup

b04c170

Remove examples

c168c2d

Remove silly .DS_Store

bf5b1c2

EliasC force-pushed the sample/shrubbery branch from c32055d to bf5b1c2 Compare September 18, 2024 12:52

mjp41 approved these changes Sep 23, 2024

View reviewed changes

samples/infix/.DS_Store Outdated

Copy link

Member

mjp41 Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't commit this file.

mjp41 merged commit f5874e6 into microsoft:main Sep 23, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Shrubbery parser as a sample #110

Added Shrubbery parser as a sample #110

EliasC commented Apr 26, 2024 •

edited

Loading

EliasC commented Apr 26, 2024

matajoh commented Jun 3, 2024

matajoh left a comment

matajoh Jun 3, 2024

EliasC Jun 3, 2024

matajoh Jun 3, 2024

matajoh Jun 3, 2024

matajoh Jun 3, 2024

matajoh Jun 3, 2024

matajoh Jun 3, 2024

EliasC Jun 3, 2024

matajoh Jun 3, 2024

EliasC Jun 3, 2024

matajoh Jun 3, 2024

EliasC commented Jun 3, 2024

matajoh commented Jun 3, 2024

EliasC commented Jun 27, 2024

mjp41 Sep 16, 2024

Added Shrubbery parser as a sample #110

Added Shrubbery parser as a sample #110

Conversation

EliasC commented Apr 26, 2024 • edited Loading

EliasC commented Apr 26, 2024

matajoh commented Jun 3, 2024

matajoh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EliasC commented Jun 3, 2024

matajoh commented Jun 3, 2024

EliasC commented Jun 27, 2024

Choose a reason for hiding this comment

EliasC commented Apr 26, 2024 •

edited

Loading