Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Shrubbery parser as a sample #110

Merged
merged 12 commits into from
Sep 23, 2024
Merged

Conversation

EliasC
Copy link
Contributor

@EliasC EliasC commented Apr 26, 2024

This PR adds a sample that parses Shrubbery Notation. It complements the infix example by having a slightly more involved parsing pass (whereas the three rewriting passes are quite straightforward). In particular, it gives an example of handling indentation sensitivity. There is a README that gives an overview of Shrubbery parsing and the implementation. The source code has explanatory comments, but assumes that the reader has done the infix tutorial and therefore understands the basics of parsing and pattern matching.

The implementation relies on #90.

@EliasC
Copy link
Contributor Author

EliasC commented Apr 26, 2024

@microsoft-github-policy-service agree

@matajoh
Copy link
Member

matajoh commented Jun 3, 2024

Small question: where do the test files come from? If they have been copied in from another repo, could we potentially clone that repo as part of the build. to avoid having all of these in the repo? Not a big deal if not, just curious. You can see an example of what I'm talking about in the parser directory:

if(TRIESTE_BUILD_PARSER_TESTS)
    enable_testing()
    add_subdirectory(test)

    if(NOT IS_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/JSONTestSuite)
    execute_process(COMMAND ${GIT_EXECUTABLE} clone --depth=1 https://github.com/nst/JSONTestSuite
                    WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
                    OUTPUT_QUIET)
    endif()

    if(NOT IS_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/yaml-test-suite)
    execute_process(COMMAND ${GIT_EXECUTABLE} clone --depth=1 --branch data-2022-01-17 https://github.com/yaml/yaml-test-suite
                    WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
                    OUTPUT_QUIET)
    endif()
endif()

It would also be neat for these tests to be added either via a test driver like json_test.cc or by adding them in CMake:

add_test(NAME json_trieste COMMAND json_trieste test -f WORKING_DIRECTORY $<TARGET_FILE_DIR:json_trieste>)

Anything to have more robust CI :)

matajoh
matajoh previously requested changes Jun 3, 2024
Copy link
Member

@matajoh matajoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these comments are non-blocking, i.e., more suggestions for improvements that could be done in the future.

I would like (if possible) to not host the test files in this repo, though.

// on the previous line. When parsing the program above, the Indent will be
// ((0,0), 2), indicating that the established indentation level is line 0
// colum 0, and that this line is continued on column 2.
struct Indent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note on Trieste usage, there are two strategies that may make your life considerably easier as regards meaningful whitespace. The first, and perhaps simplest, is to make the parser even dumber and have it explicitly capture whitespace as Whitespace and NewLine tokens, i.e.:

  R"(\r?\n)" >> [](auto& m){
    m.add(NewLine);
  }

  R"([ \t]+)" >> [](auto& m){
    m.add(Whitespace);
  }

You can then defer decisions about indents etc. to a dedicated pass in Trieste, where you'll have much more flexibility and context to make your decisions. This works rather well. Another option is to use much the same pattern, but to use pattern matching to measure the indent explicitly while processing the end of the previous line, like so:

  R"(\r?\n([ \t]*))" >> [](auto& m){
    size_t indent = m.match(1).len;
    // your indent logic
  }

This lookahead can save a bunch of bookkeeping (you can preemptively push or pop indent blocks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a version with the explicit whitespace tokens, but tried writing the Shrubbery parser as a way to stress the parser module more than the existing samples (when this was developed, infix was the only other sample in the repo). It is also nice how, modulo whitespace sensitivity, the Trieste parser sort of looks like a generalized shrubbery parser. The second option does look a lot nicer than what I am doing right now though, so I will give that a spin!

@@ -0,0 +1,26 @@
#pragma once

#include <trieste/driver.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is now to include trieste/trieste.h.

inline const auto Id = TokenDef("id");
inline const auto Lhs = TokenDef("lhs");
inline const auto Rhs = TokenDef("rhs");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is to export your final WF definition here, as otherwise people will not be able to easily use the ASTs you produce downstream.

return {
"merge alternatives",
wf_alternatives,
dir::topdown,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be turned into a dir::once pass by adding ++ to the captures below? I'll give an example of what I mean.

[](Match& _) { return Alt << (Block << *_[Alt]); },

// Merge sequence of alternatives into a single alternative
(T(Alt)[Alt] * (T(Alt) << T(Block)[Block])) >>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative to allow dir::once?

(T(Alt)[Alt] * (T(Alt) << T(Block)))++[Group] >>
  [](Match& _) {
    Node alt = _(Alt);
    for(auto& node : _[Group]){
      alt << node->front();
    }
    return alt;
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(NB this requires my PR which changes NodeRange to a span)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having foreach loops for NodeRange looks very nice! Are the parentheses in the pattern off by one? Reading the code, I would have guessed that the Group capture should not include the Alt capture, but only the subsequent nodes (so that we can add their blocks under the alt node in the effect.

return {
"drop separators",
wf_no_semis_or_commas,
dir::topdown,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a dir::bottomup|dir::once pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are dir::bottomup passes more efficient than dir::topdown passes? I can see how dir::once is preferrable.

return {
"check parsing",
wf_check_parser,
dir::topdown,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a dir::bottomup|dir::once pass I believe

@EliasC
Copy link
Contributor Author

EliasC commented Jun 3, 2024

Small question: where do the test files come from? If they have been copied in from another repo, could we potentially clone that repo as part of the build. to avoid having all of these in the repo?

I wrote them myself. Some of them come from the documentation of shrubbery notation. I have been looking for an existing test suite without success (also there would need to be a filtering process since I don't support all of Shrubbery notation).

@matajoh
Copy link
Member

matajoh commented Jun 3, 2024

Small question: where do the test files come from? If they have been copied in from another repo, could we potentially clone that repo as part of the build. to avoid having all of these in the repo?

I wrote them myself. Some of them come from the documentation of shrubbery notation. I have been looking for an existing test suite without success (also there would need to be a filtering process since I don't support all of Shrubbery notation).

OK in that case I suppose we can put them here for now. Maybe one potential improvement down the line would be to put them in a YAML file and write a test driver to process them.

@EliasC EliasC force-pushed the sample/shrubbery branch from f2ec8fc to fd37185 Compare June 27, 2024 08:49
@EliasC
Copy link
Contributor Author

EliasC commented Jun 27, 2024

@matajoh I have addressed most of your comments now.

  • Every rewrite pass is now bottomup | once and generally looks nicer.
  • I removed all the examples for now and will keep them around locally until I figure out a way to add them in a nice way or find an external test suite. They are quite weak as test cases as you still need to manually look at the produced trees to see if they did the right thing.
  • I tried experimenting with your suggested solutions for whitespace, but didn't figure out a good to do it. The problem with counting whitespace after newlines is that a line could start with spaces but continue with a comment (or nothing) meaning it shouldn't be counted. Since there is no lookahead in the parsing step, I don't know whether to check indentation levels until I actually see a term that should be added. For adding explicit whitespace tokens, it becomes quite hairy for input like
    one: two: three: four:
         two again
    
    where two again should belong to the same block as two. It's doable, but requires non-trivial logic that in the end isn't a lot cleaner than what I'm doing now.

I also added a final rewriting pass that takes the tree to a shape that matches the grammar of shrubbery notation (this is presented in shrubbery.h). Later it might make sense to add a writer that outputs the parsed program as an S-expression (or indeed as shrubbery notation), but for now I think it works as an example to learn from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't commit this file.

@mjp41 mjp41 dismissed matajoh’s stale review September 23, 2024 17:00

These changes have been addressed.

@mjp41 mjp41 merged commit f5874e6 into microsoft:main Sep 23, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants