Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rx, a program for compiling sets of regular expressions #488

Merged
merged 41 commits into from
Sep 5, 2024
Merged

Conversation

katef
Copy link
Owner

@katef katef commented Aug 19, 2024

From the manpage:

Input files have one pattern per line. Each pattern has an associated id. ids are assigned depending on the number of input files. For a single file, ids are assigned per pattern (that is, the id is the line number within the file). For multiple files, ids are assigned per file (that is, the same id is shared by all patterns within a file).

Pattern ids are made available to the generated code when successfully matching a set of one or more patterns. You can see these with -l dot output. It is possible for a given text string to match patterns associated with different ids. There are several ways to deal with this, which of these is appropriate depends on the application:

  1. Error about it at compile time. This is the default for rx(1) To use this mode, ensure your patterns don't overlap. In particular you can use rx -q as a lint to find conflicts.

  2. Give conflicting patterns the same id in the first place. This would be the case for a lexer, where you might have multiple spellings that produce the same token.

  3. Allow ambiguous patterns, and the generated API returns a set of ids. See -u.

  4. Earliest line number (lower id) wins. This would suit a firewall-like application where it doesn't matter which See -t.

  5. Longest match or most specific regex wins. This doesn't work for DFA and so is not provided by rx(1).

You can get some resource stats with -Q:

; ./build/bin/rx -Q -r literal -Fb -k str -l llvm /usr/share/dict/words > /tmp/w.ll
charset: [(none)]
reject: []
flags: 0x40
literals[0].count = 0
literals[1].count = 0
literals[2].count = 0
literals[3].count = 104334
literals (unanchored): 0 patterns, 2 states
literals (^left): 0 patterns, 2 states
literals (right$): 0 patterns, 1 states
literals (^both$): 104334 patterns, 238103 states
general: 0 patterns (limit 18446744073709551615)
declined: 0 patterns
fsm_count = 4 FSMs prior to union
nfa: 238111 states
dfa: 238104 states
rusage.utime: 8.37991
rusage.stime: 0.30796
rusage.maxrss: 524 MiB
;

There are a few small fixes and things on this branch, that superficially have nothing to do with rx. That's because I originally had much more groundwork here, which I've pulled out to separate PRs (especially #485 and #486, but also others). I want to keep the history for rx itself intact, rather than rebase away the stuff I moved out to other PRs. So I've merged over from main, and left the few seemingly-unrelated fixes without rebasing them out.

rx was named by @averymcnab

katef added 30 commits June 19, 2024 12:30
…ng at \n-terminated patterns in-situ. This is just so much less complicated.
I think at the time of writing, fsm_example() is broken, because I get no output here. Possibly related to #438
…ray.

This sets the scene for multi-file pattern lists, later on. But also seems to keep scopes tighter and make things simpler as a byproduct.
Now this is always equivalent to the count passed in, when the return status is 1. And the return status is always 1 when the count is enough. In all situations we know the count is enough.
This allows for eventually iterating over multiple input files.
When only one file is given, the endid is per pattern (that is, the line number) in the input file.

When multiple files are given, the endid is the argv[] index the file. That is, all patterns within each file share the same endid.
This should find the same result anyway, there's just no need to go to the trouble of constructing an AST here, when we already know the pattern is a literal.
I'm not particularly thrilled about the handling for different AMBIG_ modes here. I'm thinking eventually it might make sense to move this stuff into libfsm proper and share it with the other cli tools.
Copy link
Collaborator

@silentbicycle silentbicycle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

I already mentioned the error where it generates C code that attempts to write into a const unless -u is set. That seems like it came from an earlier PR, though.

assert(!fsm_empty(fsm));

if (!fsm_setendid(fsm, id)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intent for setting this after minimisation?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That there's only one end id for this fsm (as the comment just above says), so there's no need to deal with endids through the various transformations as we construct the fsm. I'm setting the id after construction just because it's unnecessary to set it any earlier.

fprintf(stderr, "overriding dialect by extension for %s: %s\n",
argv[arg], ext);
dialect = dialect_name(ext);
if (override_dialect) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a good idea to make this require a flag, rather than being fully automatic behavior.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is! -s

@katef katef merged commit dc9721f into main Sep 5, 2024
346 checks passed
@katef katef deleted the kate/rx branch September 5, 2024 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants