-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to parse “anything that is not this Token or that Token” #1
Comments
Currently, there's no way to "peek" into parsers and check the first character; either the parser runs, or it doesn't. (The implementation is entirely closures/ If you're OK with it running the entire parser for the error case, then yes, single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')
forbidden = single_quoted_string | double_quoted_string | semicolon
allowed_char = forbidden.not_ahead >> any(Char)
language = allowed_char.repeat(..).join
puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid") Output:
I don't consider this that bad, since there's only a performance hit on erroneous input. But if possible, I would suggest restructuring your parsers so you don't need this at all. e.g., instead of this: string = single_quoted_string | double_quoted_string
var_name_char = string.not_ahead >> any(Char)
var_name = allowed_char.repeat(..).join
language = var_name | string Flip the choice order for string = single_quoted_string | double_quoted_string
var_name = any(Char).repeat(..).join
language = string | var_name
If you still really need the checks, but don't want the performance penalty for the error case, you'll have to manually construct a new parser, likely with single_quoted_string = token('\'') >> not('\'').repeat(..) << token('\'')
double_quoted_string = token('"') >> not('"').repeat(..) << token('"')
semicolon = token(';')
allowed_char = none_of(['\'', '"', ';'])
language = allowed_char.repeat(..).join
puts language.parse("this is ok")
puts language.parse("'invalid'")
puts language.parse("still \"invalid\"")
puts language.parse("in;valid") This has the same output as the first example. (Although the semantics are slightly different: The first example actually succeeds if e.g. the string is never closed) |
The full concept is that I want to use this library for a shell interpreter. It would have constructs like
Basically the parser would look like: any_token = whitespace | single_string | double_string | variable | semicolon | subshell | closure # and any other tokens except bareword
bareword = # How to do it?
source_code = bareword | any_token There is a more detailed code I wrote for this (currently only various strings, but no shubshell or variables), but it somehow doesn't work, and I don't know what should I change. Seems like the Crystal type system doesn't like more complex parsers. require "parsem"
include Parsem
single_quote = token '\''
double_quote = token '"'
quote = single_quote | double_quote
# quoted string literals
single_string = single_quote >> not(single_quote).repeat(..).join << single_quote
double_string = double_quote >> not(double_quote).repeat(..).join << double_quote
quoted_string = single_string | double_string
# anything that is not bareword
any_other = quoted_string | whitespace
bareword_string = (any_other.not_ahead >> any(Char)).repeat(..).join
string_literal = bareword_string | quoted_string
sourcecode = (string_literal << whitespace).repeat(..).extend <=> string_literal
pp sourcecode.parse %(bareword "this is a string" "this is another" ) |
Some notes about why your code wasn't compiling:
Also: Here is something that I think does roughly what you want: SINGLE_QUOTE = '\''
DOUBLE_QUOTE = '"'
# quoted string literals
single_string = token(SINGLE_QUOTE) >> not(SINGLE_QUOTE).repeat(..).join << token(SINGLE_QUOTE)
double_string = token(DOUBLE_QUOTE) >> not(DOUBLE_QUOTE).repeat(..).join << token(DOUBLE_QUOTE)
quoted_string = single_string | double_string
bareword_char = none_of [SINGLE_QUOTE, DOUBLE_QUOTE, *" \t\r\n".chars]
bareword_string = bareword_char.repeat(1..).join
string_literal = quoted_string | bareword_string
# Not using the pattern from the CSV parser
# because the delimiter (whitespace) is valid at the end as well.
# If there's any whitespace after the last string, that pattern
# eats the final whitespace, then requires another string after,
# which, of course, fails.
sourcecode = ws >> (string_literal << ws).repeat(..) << ws
puts sourcecode.parse %()
puts sourcecode.parse %(bareword "this is a string" "this is another" )
puts sourcecode.parse %( lots of bareword 'single quoted') Unfortunate that I've had to write out all the whitespace chars. This could also work: bareword_char = whitespace.not_ahead >> none_of [SINGLE_QUOTE, DOUBLE_QUOTE] although it's ever so slightly less efficient. Also, please note that I've just released version 1.1.2, which fixes an infinite loop bug in |
There are several tokens and
Parser
s, e.g.single_quoted_string
,double_quoted_string
,semicolon
, etc.How to parse any char that is not the start of these?
It would be basically
#none_of
, but withParser
, notToken
. Or maybe some combination of#not_ahead
?The text was updated successfully, but these errors were encountered: