rx, a program for compiling sets of regular expressions #488

katef · 2024-08-19T22:59:46Z

From the manpage:

Input files have one pattern per line. Each pattern has an associated id. ids are assigned depending on the number of input files. For a single file, ids are assigned per pattern (that is, the id is the line number within the file). For multiple files, ids are assigned per file (that is, the same id is shared by all patterns within a file).

Pattern ids are made available to the generated code when successfully matching a set of one or more patterns. You can see these with -l dot output. It is possible for a given text string to match patterns associated with different ids. There are several ways to deal with this, which of these is appropriate depends on the application:

Error about it at compile time. This is the default for rx(1) To use this mode, ensure your patterns don't overlap. In particular you can use rx -q as a lint to find conflicts.

Give conflicting patterns the same id in the first place. This would be the case for a lexer, where you might have multiple spellings that produce the same token.

Allow ambiguous patterns, and the generated API returns a set of ids. See -u.

Earliest line number (lower id) wins. This would suit a firewall-like application where it doesn't matter which See -t.

Longest match or most specific regex wins. This doesn't work for DFA and so is not provided by rx(1).

You can get some resource stats with -Q:

; ./build/bin/rx -Q -r literal -Fb -k str -l llvm /usr/share/dict/words > /tmp/w.ll
charset: [(none)]
reject: []
flags: 0x40
literals[0].count = 0
literals[1].count = 0
literals[2].count = 0
literals[3].count = 104334
literals (unanchored): 0 patterns, 2 states
literals (^left): 0 patterns, 2 states
literals (right$): 0 patterns, 1 states
literals (^both$): 104334 patterns, 238103 states
general: 0 patterns (limit 18446744073709551615)
declined: 0 patterns
fsm_count = 4 FSMs prior to union
nfa: 238111 states
dfa: 238104 states
rusage.utime: 8.37991
rusage.stime: 0.30796
rusage.maxrss: 524 MiB
;

There are a few small fixes and things on this branch, that superficially have nothing to do with rx. That's because I originally had much more groundwork here, which I've pulled out to separate PRs (especially #485 and #486, but also others). I want to keep the history for rx itself intact, rather than rebase away the stuff I moved out to other PRs. So I've merged over from main, and left the few seemingly-unrelated fixes without rebasing them out.

rx was named by @averymcnab

…ssions.

…sages.

…ng at \n-terminated patterns in-situ. This is just so much less complicated.

I think at the time of writing, fsm_example() is broken, because I get no output here. Possibly related to #438

…input files.

…ray. This sets the scene for multi-file pattern lists, later on. But also seems to keep scopes tighter and make things simpler as a byproduct.

Now this is always equivalent to the count passed in, when the return status is 1. And the return status is always 1 when the count is enough. In all situations we know the count is enough.

This allows for eventually iterating over multiple input files.

When only one file is given, the endid is per pattern (that is, the line number) in the input file. When multiple files are given, the endid is the argv[] index the file. That is, all patterns within each file share the same endid.

This should find the same result anyway, there's just no need to go to the trouble of constructing an AST here, when we already know the pattern is a literal.

… handle in the caller.

I'm not particularly thrilled about the handling for different AMBIG_ modes here. I'm thinking eventually it might make sense to move this stuff into libfsm proper and share it with the other cli tools.

I'm not sure why I thought this was neccessary; I tested and the trie code seems happy with an empty string. Which is fortunate, because I'm a believer in recursive datastructures.

Spotted by Dan Kegel, thank you.

silentbicycle

This makes sense to me.

I already mentioned the error where it generates C code that attempts to write into a const unless -u is set. That seems like it came from an earlier PR, though.

silentbicycle · 2024-09-04T17:48:07Z

src/rx/main.c

 	assert(!fsm_empty(fsm));

+	if (!fsm_setendid(fsm, id)) {


What's the intent for setting this after minimisation?

That there's only one end id for this fsm (as the comment just above says), so there's no need to deal with endids through the various transformations as we construct the fsm. I'm setting the id after construction just because it's unnecessary to set it any earlier.

silentbicycle · 2024-09-04T17:50:06Z

src/rx/main.c

-			fprintf(stderr, "overriding dialect by extension for %s: %s\n",
-				argv[arg], ext);
-			dialect = dialect_name(ext);
+		if (override_dialect) {


It seems like a good idea to make this require a flag, rather than being fully automatic behavior.

It is! -s

katef added 30 commits June 19, 2024 12:30

First cut at introducing rx, a tool to compile a set of regular expre…

a39f723

…ssions.

Move to structs for literal and id sets.

994390e

Fold together literal sets into an array indexed by flags.

976313f

Clarification.

f234a1c

Port to fsm_intersect_charset.

e74a9d4

Rework categorisation for (hopefully) clearer outputs and verbose mes…

ee57739

…sages.

Back to POSIX for this, I messed up strndup.

237544a

Tidying up.

708f7f8

Clarification, resolving a few TODOs, smaller CLI options etc.

3ab2029

Allocate and duplicate for \0-terminated patterns, rather than pointi…

37d34c8

…ng at \n-terminated patterns in-situ. This is just so much less complicated.

Simplification, deal with -n earlier.

aa0eca1

Give an example for AMBIG_ERROR.

b6fe284

I think at the time of writing, fsm_example() is broken, because I get no output here. Possibly related to #438

Move declined-file to an optarg, we need to keep argv[] for multiple …

7290f01

…input files.

Store patterns per set, rather than indexing into a global pattern ar…

798f8e7

…ray. This sets the scene for multi-file pattern lists, later on. But also seems to keep scopes tighter and make things simpler as a byproduct.

Check end id ambiguity before printing, rather than during.

41d8225

No need for the *nwritten parameter.

a6deb96

Now this is always equivalent to the count passed in, when the return status is 1. And the return status is always 1 when the count is enough. In all situations we know the count is enough.

Switch to reading and allocating line-by-line, rather than mmap().

4e5cf35

This allows for eventually iterating over multiple input files.

C99 doesn't allow a different type qualifier here, only C2x.

341dfc8

Multi-file input.

9779714

When only one file is given, the endid is per pattern (that is, the line number) in the input file. When multiple files are given, the endid is the argv[] index the file. That is, all patterns within each file share the same endid.

Skip re_is_literal()'s parsing an AST construction for literals.

b436b81

This should find the same result anyway, there's just no need to go to the trouble of constructing an AST here, when we already know the pattern is a literal.

Override dialect by file extension.

d546585

-p for prefix.

6d2f1f3

-F for re_flags.

45e3cfc

No need to handle endids during fsm_minimise here.

c7d242f

Don't call intersect_charset() with an empty charset, it's clearer to…

53164ae

… handle in the caller.

Comment generated code

4232303

Codegen for various languages.

ff51a75

I'm not particularly thrilled about the handling for different AMBIG_ modes here. I'm thinking eventually it might make sense to move this stuff into libfsm proper and share it with the other cli tools.

-x for unanchored literals.

4aa804a

-s for overriding dialect by file extension.

a564db1

-e and -E for prefixes.

18b2d73

katef added 11 commits August 9, 2024 05:40

Merge branch 'kate/more-multi' into kate/rx

e6950b8

Merge branch 'main' into kate/rx

a435074

-t for AMBIG_EARLIEST

f0fea27

Graphviz output for the DFA.

26832cc

-a and -w for anonymous states and fragment output.

d062ec9

-X for always_hex

5e10440

Manpage for rx(1)

9668a76

Missing flag, this is now present on main.

c08782a

Stray assertion.

d73682b

I'm not sure why I thought this was neccessary; I tested and the trie code seems happy with an empty string. Which is fortunate, because I'm a believer in recursive datastructures.

Spelling.

8c9b209

Spelling.

8f9ea4d

Spotted by Dan Kegel, thank you.

silentbicycle reviewed Sep 4, 2024

View reviewed changes

katef merged commit dc9721f into main Sep 5, 2024
346 checks passed

katef deleted the kate/rx branch September 5, 2024 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rx, a program for compiling sets of regular expressions #488

rx, a program for compiling sets of regular expressions #488

katef commented Aug 19, 2024 •

edited

Loading

silentbicycle left a comment

silentbicycle Sep 4, 2024

katef Sep 5, 2024

silentbicycle Sep 4, 2024

katef Sep 5, 2024

rx, a program for compiling sets of regular expressions #488

rx, a program for compiling sets of regular expressions #488

Conversation

katef commented Aug 19, 2024 • edited Loading

silentbicycle left a comment

Choose a reason for hiding this comment

silentbicycle Sep 4, 2024

Choose a reason for hiding this comment

katef Sep 5, 2024

Choose a reason for hiding this comment

silentbicycle Sep 4, 2024

Choose a reason for hiding this comment

katef Sep 5, 2024

Choose a reason for hiding this comment

katef commented Aug 19, 2024 •

edited

Loading