Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backported pcre-anchor tests and fixes, found while fuzzing on capture branch. #453

Merged
merged 18 commits into from
Jan 3, 2024

Conversation

silentbicycle
Copy link
Collaborator

This PR adds tests/pcre-anchor/ tests 82-100, which are hand-reduced versions of incorrectly handled regex inputs found during fuzzing.

Most of these have to do with:

  • non-local interactions between one or more anchors and the rest of the regex (e.g. how the ^ anchor in (a|$)(b|^) should not link to the unanchored start loop ("pincer anchors"))
  • repeat{0,_} nodes with unsatisfiable, unsupported, etc. subtrees that should be pruned by setting their max count to 0 rather than bubbling their error up further
  • incorrect handling of PCRE's anchors ignoring a single trailing newline
  • returning an unsupported PCRE behavior error for some cases that are too obscure and too tricky to be worth handling properly, such as how $[^a] should not match "x" but should match "x\n" according to PCRE

These are anchoring edge cases found by fuzzing. They have to do
with incomplete pruning of repeated nodes that are unsatisfiable due to
anchoring, nesting of anchors inside groups and/or ALT subtrees, etc.

The fixes will go in their own commit.
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE
per se, but depend on particular corner cases in PCRE that probably
aren't worth supporting in an automata-based implementation.

Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b*

This is a tricky one to handle properly; according to PCRE it should
match either "a<anything...>" OR "\n", but nothing else. The newline
match is because $ is a non-input-consuming check that evaluation is
either at the end of input, or at a newline immediately before the end.
In this case `$[^x]b*` matches exactly one newline; it's equivalent to
"$\n". This probably isn't worth supporting, but we can detect cases
where a potential newline match appears after a $ and reject them as
an unsupported PCRE behavior.
`^(?:$^|a)(?:$^|^)*`: This should be equivalent to `^(?:^$|a)` because
the first group is always either fully anchored (making the second
redundant) or must match 'a' (making it being anchored at the start
impossible), but that information wasn't being passed properly to the
second group.

`$[^a]*` was returning AST_ANALYSIS_ERROR_UNSUPPORTED_PCRE while
inside the repeat node (`[]*`), so it didn't flag the `^` immediately
after. If a repeat node that can match 0 times is unsupported,
unsatisfiable, etc., just set its max count to 0 and continue. With
this change analysis finds the `^` and treats the whole regex overall
like `$^|$^`, which reduces to `^$`.

The fixes for both of these will be integrated in later commits -- I'm
integrating changes from a working branch that had several intermediate
steps committed to push to another computer for fuzzing.
If a repeat node's subtree returns an error but has a min count of
0, then prune it by setting its max count to 0. If its error is
returned directly then the analysis won't see nodes that appear
after it.
This should normally be set when allocation fails, but when built
for fuzzing analysis can pre-emptively reject inputs that would
need excessive memory due to large repetition counts (for example),
so in that case make sure an error code is set to avoid triggering
an assertion later on.
Lots of other bugs in linking have been indirect consequences of an
ancestor ALT node that changed its top-down x,y linkage to global
or the global start loop instead. Avoiding modification here eliminates
a lot of noise. While trying to simultaneously satisfy the pcre-anchor
tests 92, 99, and 100 I took a step back, thought through the system as
a whole, and realized this was where their linkage was getting messed up.
Rather than modifying this directly, add a function, which has
its own asserts and logging.
@katef katef merged commit 0f5a966 into main Jan 3, 2024
322 checks passed
@katef katef deleted the sv/backported-fixes-from-fuzzing-on-capture-branch branch January 3, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants