-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a way to bind match results in patterns #9
Comments
Here's another use case. Rust integers can be lexed like this: rule Init {
...
("0b" | "0o" | "0x")? ($digit | '_')* $id? =? |lexer| {
let match_ = lexer.match_();
lexer.return_(check_int(match_))
},
}
// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(match_: &'input str) -> Result<Token, CustomError> { ... } With this implementation, With the proposed feature, it could be implemented as: rule Init {
...
<prefix:("0b" | "0o" | "0x")?> <digits:($digit | '_')*> <suffix:$id?> =? |lexer| {
let match_ = lexer.match_();
lexer.return_(check_int(prefix, digits, suffix))
},
}
// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(prefix: &'input str, digits: &'input str, suffix: &'input str) -> Result<Token, CustomError> { ... } |
I'm not sure how to implement this feature yet, but here's an idea. Suppose I have a rule with these regexes:
In the generated code, after seeing 4
After 4 Conceptually, it seems like we could annotate NFA nodes with "start match" and "end match". In the example above, initially we will have 2 NFAs:
We add the captures to these NFAs like this:
We will also need the regex number for these "start match" and "end match" annotations, to be able to update the right match stack in the match stack list when starting/ending a match. Once we have these, we can generate the DFA like this:
(Notation is a bit strange, but hopefully conveys the idea) Now in state 4 if we see an
Now we bind the variables in the first regex to the matches in the second match stack and run the semantic action. Same thing if we see an Regexes that don't bind variables do not need a match stack, so the max. number of stacks will be quite small for real-world examples. I doubt we will see more than 5-6 stacks at a time. We know size of the each match stack in compile time, so we can allocate each match stack with the right capacity. Lexers that don't use this feature will still be allocation-free. Others will allocate a vector every time we start matching a regex that uses this feature. In that sense this feature is "zero cost" (i.e. you don't pay a price if you don't use it). Match stacks are reset when we run a semantic action. We remove the match stack used in the semantic action and clear the list. TODO: Need to figure out how to implement backtracking. |
It seems difficult to support this syntax with full generality. For example
Here type of No idea how to generalize the idea in my previous comment to handle this case. |
Suppose I want to parse
\xHH
syntax in a rule (H
is a hex digit). Currently it's quite painful to get the chars for those digits in a larger rule:The problem is the match will have the input since the beginning of the current match, so for example, if this is in a string lexing rule and the string is
"blah blah \xAA"
then match will be the entire string except the closing"
.One workaround is to add another rule for the hex digits and push them to a buffer in each match, but that's also verbose.
If we had a way to bind regexes inside patterns we could do:
where
d1
andd2
would bechar
s.The text was updated successfully, but these errors were encountered: