Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse “anything that is not this Token or that Token” #1

Open
notramo opened this issue Nov 16, 2022 · 3 comments
Open

How to parse “anything that is not this Token or that Token” #1

notramo opened this issue Nov 16, 2022 · 3 comments

Comments

@notramo
Copy link

notramo commented Nov 16, 2022

There are several tokens and Parsers, e.g. single_quoted_string, double_quoted_string, semicolon, etc.
How to parse any char that is not the start of these?

It would be basically #none_of, but with Parser, not Token. Or maybe some combination of #not_ahead?

@ThatsJustCheesy
Copy link
Owner

ThatsJustCheesy commented Nov 16, 2022

Currently, there's no way to "peek" into parsers and check the first character; either the parser runs, or it doesn't. (The implementation is entirely closures/Procs, which, of course, can't be inspected after construction.)

If you're OK with it running the entire parser for the error case, then yes, #not_ahead would be the way to do it:

single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')

forbidden = single_quoted_string | double_quoted_string | semicolon
allowed_char = forbidden.not_ahead >> any(Char)

language = allowed_char.repeat(..).join

puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid")

Output:

this is ok
ParseError: expected end of input, but found 'invalid'
ParseError: expected end of input, but found "invalid"
ParseError: expected end of input, but found ;valid

I don't consider this that bad, since there's only a performance hit on erroneous input.

But if possible, I would suggest restructuring your parsers so you don't need this at all. e.g., instead of this:

string = single_quoted_string | double_quoted_string

var_name_char = string.not_ahead >> any(Char)
var_name = allowed_char.repeat(..).join

language = var_name | string

Flip the choice order for language:

string = single_quoted_string | double_quoted_string

var_name = any(Char).repeat(..).join

language = string | var_name

| effectively checks the first token and switches to var_name when necessary, without additional checks in var_name.

If you still really need the checks, but don't want the performance penalty for the error case, you'll have to manually construct a new parser, likely with none_of:

single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')

allowed_char = none_of(['\'', '"', ';'])

language = allowed_char.repeat(..).join

puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid")

This has the same output as the first example. (Although the semantics are slightly different: The first example actually succeeds if e.g. the string is never closed)

@notramo
Copy link
Author

notramo commented Nov 16, 2022

The full concept is that I want to use this library for a shell interpreter. It would have constructs like $variables, subshell commands: (exa -l -h), output redirection: > out.txt, command separator: ;, and so on. Then there would be the bareword, which (to make things simpler for the user) would be any character that is not space, or special language construct. String concatenation would work as in the Elvish shell, by writing two operands contiguosly without space (not the ^ character, like in BASH). This would enable the following syntax:

# vabiable interpolation: close bareword token with a $variable start
mv /tmp/$filename[0..-4] output/$filename".jxl"

# multiple commands: close bareword token with semicolon
exa -l -h; pijul status

# subshell: close bareword with subhell open (outer command), and subshell close (inner command)
kak (fd -t f src/) 

Basically the parser would look like:

any_token = whitespace | single_string | double_string | variable | semicolon | subshell | closure # and any other tokens except bareword

bareword = # How to do it? 

source_code = bareword | any_token

There is a more detailed code I wrote for this (currently only various strings, but no shubshell or variables), but it somehow doesn't work, and I don't know what should I change. Seems like the Crystal type system doesn't like more complex parsers.

require "parsem"

include Parsem

single_quote = token '\''
double_quote = token '"'

quote = single_quote | double_quote

# quoted string literals
single_string = single_quote >> not(single_quote).repeat(..).join << single_quote
double_string = double_quote >> not(double_quote).repeat(..).join << double_quote
quoted_string = single_string | double_string

# anything that is not bareword
any_other = quoted_string | whitespace

bareword_string = (any_other.not_ahead >> any(Char)).repeat(..).join

string_literal = bareword_string | quoted_string

sourcecode = (string_literal << whitespace).repeat(..).extend <=> string_literal

pp sourcecode.parse %(bareword "this is a string" "this is another" )

@ThatsJustCheesy
Copy link
Owner

Some notes about why your code wasn't compiling:

  • not requires a Token as input, but you've passed a Parser(Token, Token)
    • i.e. write not('\'') instead of not(token '\'')
  • Looks like not_ahead has a bug! It's not compiling when the parser's output has already been transformed into non-tokens. I'll fix that soon

Also: whitespace is a single-character parser only. I suggest using ws, which is just a shortcut for whitespace.repeat(..). Perhaps the naming could be improved here.

Here is something that I think does roughly what you want:

SINGLE_QUOTE = '\''
DOUBLE_QUOTE = '"'

# quoted string literals
single_string = token(SINGLE_QUOTE) >> not(SINGLE_QUOTE).repeat(..).join << token(SINGLE_QUOTE)
double_string = token(DOUBLE_QUOTE) >> not(DOUBLE_QUOTE).repeat(..).join << token(DOUBLE_QUOTE)
quoted_string = single_string | double_string

bareword_char = none_of [SINGLE_QUOTE, DOUBLE_QUOTE, *" \t\r\n".chars]
bareword_string = bareword_char.repeat(1..).join

string_literal = quoted_string | bareword_string

# Not using the pattern from the CSV parser
# because the delimiter (whitespace) is valid at the end as well.
# If there's any whitespace after the last string, that pattern
# eats the final whitespace, then requires another string after,
# which, of course, fails.
sourcecode = ws >> (string_literal << ws).repeat(..) << ws

puts sourcecode.parse %()
puts sourcecode.parse %(bareword "this is a string" "this is another" )
puts sourcecode.parse %( lots of bareword 'single quoted')

Unfortunate that I've had to write out all the whitespace chars. This could also work:

bareword_char = whitespace.not_ahead >> none_of [SINGLE_QUOTE, DOUBLE_QUOTE]

although it's ever so slightly less efficient.

Also, please note that I've just released version 1.1.2, which fixes an infinite loop bug in repeat that I found while writing this code. I'll work on the other fixes/improvements this issue helped identify at some later time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants