Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update regex_suffix.c #1095

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion libclamav/regex_suffix.c
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ static uint8_t *parse_char_class(const uint8_t *pat, size_t patSize, size_t *pos
for (c = range_start + 1; c <= range_end; c++)
bitmap[c >> 3] ^= 1 << (c & 0x7);
hasprev = 0;
} else if (pat[*pos] == '[' && pat[*pos] == ':') {
} else if (pat[*pos] == '[' && pat[*pos + 1] == ':') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if this is the correct fix, or if it was supposed to be || instead of &&. Were you able to identify based on your understanding of regex character classes what this is trying to do and which would be the correct fix? I'm hesitant to make either change without knowing what was originally intended. Ideally we'd be able to test the fix also makes something work that previously wouldn't work. E.g. if there is some regex pattern that wouldn't work before and would now work after this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making suggestions based on static code review so you are right to question them.

Searching for "[:" makes sense because there are predefined character classes in POSIX ( https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions#Character_classes ).

If using a regex with one of those character classes works with the fix but not without, that would show that the fix is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller of this function is here in parse_regex():

            case '[':
                ++*last;
                right = make_charclass(parse_char_class(p, pSize, last));
                if (!right) {
                    destroy_tree(v);
                    return NULL;
                }
                v = make_node(concat, v, right);
                if (!v) {
                    destroy_tree(right);
                    return NULL;
                }
                ++*last;
                break;

So it has already found the leading [. So it would make sense if this were looking for something like [[:lower:]]

Note this regex parsing code is only used in the phishing signature regex parsing code for PDB and WDB signatures.

To test this, I made this PDB signature:

H:bankofamerica.com

to monitor the display domain name "bankofamerica.com" found in emails for suspicious real-URL's.
Then I made this WDB signature to test the regex features:

X:.+\.eba[[:digit:]]\.com:.+\.bankofamerica\.com

This silly signature will allow the real-URL to be eba5.com (or any other digit after eba), but anything other than that (such as ebay.com, or ebat.com) will not be ignored and will alert as suspicious.

I tested by scanning this email, with variations of the link:

From: "Somebody -X (username - Contractor at Example)"
	<[email protected]>
To: "Your Name (Your Name)" <[email protected]>
Subject: RE: Your Money
Date: Fri, 12 Jun 2013 07:04:52 +0000
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_blah_"
MIME-Version: 1.0

--_blah_
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=252">
</head>
<body>

<br>
Please click this link:
<br>

<a href="http://www.eba5.com">https://www.bankofamerica.com</a>

<br>
</body>
</html>

--_blah_--

I tested ClamAV 1.2.0 to see if using the [[:digit:]] character class works.

E.g. I ran commands like this:

./install/bin/clamscan -d ../test.pdb -d ../test.wdb ../test.eml
Loading:     0s, ETA:   0s [========================>]        1/1 sigs
Compiling:   0s, ETA:   0s [========================>]       10/10 tasks

LibClamAV info: Suspicious link found!
LibClamAV info:   Real URL:    http://www.ebay.com
LibClamAV info:   Display URL: https://www.bankofamerica.com
/home/micah/workspace/clamav-micah-2/test.eml: Heuristics.Phishing.Email.SSL-Spoof FOUND

It appears to work already, though I didn't debug step through line by line to see how it works. But I find that strange, because I did test in a debugger with this PR and was able to confirm it working, and also running the "new" code:

image

In short, it's not clear to me if this code is really needed. Though it does seem to work okay with your change. But it also works okay like this:

            hasprev = 0;
        // } else if (pat[*pos] == '[' && pat[*pos + 1] == ':') {
        //     /* char class */
        //     FREE(bitmap);
        //     while (pat[*pos] != ']') INC_POS(pos, patSize);
        //     INC_POS(pos, patSize);
        //     while (pat[*pos] != ']') INC_POS(pos, patSize);
        //     return dot_bitmap;
        } else {
            bitmap[pat[*pos] >> 3] ^= 1 << (pat[*pos] & 0x7);

As a side note: testing character classes relating to upper/lower character changes will not work as expected because the text is all converted to lowercase before matching.

/* char class */
FREE(bitmap);
while (pat[*pos] != ']') INC_POS(pos, patSize);
Expand Down
Loading