Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Regexp Matching & Version Capture #123

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

scottdharvey
Copy link

@scottdharvey scottdharvey commented Feb 14, 2025

BLUF: These changes stopped 206 regexp.Parse() errors, and presumably added 206 cases where versions can now be captured where they could not be previously.

I noticed that under the compileFingerprint() function, errors from ParsePattern are silenced, and we move on to the next pattern. While this is reasonable for loading signatures that change regularly, I wanted to inspect these errors and see what could be improved.

I printed these errors and found 226 patterns that were failing to compile. Further investigation revealed that the majority of these failures were due to two specific regex patterns for capturing versions.

Version Capture Pattern 1

((?:\\d+\\.)+\\d+)
102 occurrences

Version Capture Pattern 2

(\\d+(?:\\.\\d+)+))
104 occurrences

ParsePattern() applies some cleanup to the expressions to prepare them for regexp.Compile().

regexPattern = strings.ReplaceAll(regexPattern, "/", "\\/")
regexPattern = strings.ReplaceAll(regexPattern, "\\+", "__escapedPlus__")
regexPattern = strings.ReplaceAll(regexPattern, "+", "{1,250}")
regexPattern = strings.ReplaceAll(regexPattern, "*", "{0,250}")
regexPattern = strings.ReplaceAll(regexPattern, "__escapedPlus__", "\\+")

Example

in fingerprints_data.json

embed-any-document(?:\\/js)?(?:\\/embed-public)?(?:\\/pdfobject)?(?:\\.min)?\\.js(?:\\?v(?:er)?=((?:\\d+\\.)+\\d+))?\\;version:\\1

after json parse

embed-any-document(?:\/js)?(?:\/embed-public)?(?:\/pdfobject)?(?:\.min)?\.js(?:\?v(?:er)?=((?:\d+\.)+\d+))?

after ParsePattern() changes

embed-any-document(?:\\/js)?(?:\\/embed-public)?(?:\\/pdfobject)?(?:\.min)?\.js(?:\?v(?:er)?=((?:\d{1,250}\.){1,250}\d{1,250}))?

A Version Capture pattern gets changed in this process:
((?:\\d+\\.)+\\d+)
((?:\d{1,250}\.){1,250}\d{1,250}))

error

error parsing regexp: invalid repeat count: '{1,250}'

Changes

1 - Remove erroneous pattern change

regexPattern = strings.ReplaceAll(regexPattern, "/", "\\/")

This change results in what I believe was in unintended change to patterns. For example:

in file

(?:((?:\\d+\\.)+\\d+)\\/)?chroma(?:\\.min)?\\.js\\;version:\\1

after json parse

(?:((?:\d{1,250}\.){1,250}\d{1,250})\\/)?chroma(?:\.min)?\.js\\;version:\\1

You'll notice that right before )?chroma the pattern is \\/
If we replace / with \\/ then we end up with \\\\/ which gets changed back to \\/

This would require a string like /ajax/libs/chroma-js/2.4.2\/chroma.min.js (notice the \/chroma)

I removed this ReplaceAll line. If it is put back, you will notice that some of the added test cases under TestExtractVersion will fail

2 - Preserve and modify version capture strings

I preserved the two version capture strings listed above using a pattern similar to "escapedPlus". After the existing string replacements to the pattern, I restored these two patterns with slight modifications. I thought 250 integers per section in a version was excessive, so I limited to 20 (probably still excessive).

// save version capture groups
regexPattern = strings.ReplaceAll(regexPattern, verCap1, verCap1Fill)
regexPattern = strings.ReplaceAll(regexPattern, verCap2, verCap2Fill)

regexPattern = strings.ReplaceAll(regexPattern, "\\+", "__escapedPlus__")
regexPattern = strings.ReplaceAll(regexPattern, "+", "{1,250}")
regexPattern = strings.ReplaceAll(regexPattern, "*", "{0,250}")
regexPattern = strings.ReplaceAll(regexPattern, "__escapedPlus__", "\\+")

// restore version capture groups
regexPattern = strings.ReplaceAll(regexPattern, verCap1Fill, verCap1Limited)
regexPattern = strings.ReplaceAll(regexPattern, verCap2Fill, verCap2Limited)

Old: (?:((?:\d{1,250}\.){1,250}\d{1,250})\\/)?chroma(?:\.min)?\.js
New: (?:((?:\d{1,20}\.){1,20}\d{1,20})\/)?chroma(?:\.min)?\.js

These changes stopped 206 regexp.Parse() errors, and presumably added 206 cases where versions can now be captured where they could not be previously.

The remaining 20 errors are as follows, divided by general parse error:

after json parse: '^(?!.*player).*aniview\.com/', after change: '^(?!.{0,250}player).{0,250}aniview\.com\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '^(?!.*player).*aniview\.com/', after change: '^(?!.{0,250}player).{0,250}aniview\.com\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '^(?:(?!psecn).)*$', after change: '^(?:(?!psecn).){0,250}$' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '/sites/(?!(?:default|all)/).*/(?:files|themes|modules)/', after change: '\/sites\/(?!(?:default|all)\/).{0,250}\/(?:files|themes|modules)\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '/sites/(?!(?:default|all)/).*/(?:files|themes|modules)/', after change: '\/sites\/(?!(?:default|all)\/).{0,250}\/(?:files|themes|modules)\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '/sites/(?!(?:default|all)/).*/(?:files|themes|modules)/', after change: '\/sites\/(?!(?:default|all)\/).{0,250}\/(?:files|themes|modules)\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '/sites/(?!(?:default|all)/).*/(?:files|themes|modules)/', after change: '\/sites\/(?!(?:default|all)\/).{0,250}\/(?:files|themes|modules)\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '/sites/(?!(?:default|all)/).*/(?:files|themes|modules)/', after change: '\/sites\/(?!(?:default|all)\/).{0,250}\/(?:files|themes|modules)\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '(?!cdn-loyalty)\.yotpo\.com/', after change: '(?!cdn-loyalty)\.yotpo\.com\/' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: 'leaflet.{0,32}\.js(?!.+shopify)', after change: 'leaflet.{0,32}\.js(?!.{1,250}shopify)' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '(?!angular\.io)\bangular.{0,32}\.js', after change: '(?!angular\.io)\bangular.{0,32}\.js' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`
after json parse: '\.acquire\.io/(?!cobrowse)', after change: '\.acquire\.io\/(?!cobrowse)' - error parsing regexp: invalid or unsupported Perl syntax: `(?!`

after json parse: '<[^>]+/binaries/(?:[^/]+/)*content/gallery/', after change: '<[^>]{1,250}\/binaries\/(?:[^\/]{1,250}\/){0,250}content\/gallery\/' - error parsing regexp: invalid repeat count: `{0,250}`
after json parse: '([\d]+(?:\.[\d]+)*)', after change: '([\d]{1,250}(?:\.[\d]{1,250}){0,250})' - error parsing regexp: invalid repeat count: `{0,250}`
after json parse: 'api\.quixchat\.com/assets/js/quixchat\.js\?ver=(\d+\.\d+)(?:\.\d+)*', after change: 'api\.quixchat\.com\/assets\/js\/quixchat\.js\?ver=(\d{1,250}\.\d{1,250})(?:\.\d{1,250}){0,250}' - error parsing regexp: invalid repeat count: `{0,250}`
after json parse: '/([\d.]+(?:-?rc[.\d]*)*)/angular(?:\.min)?\.js', after change: '\/([\d.]{1,250}(?:-?rc[.\d]{0,250}){0,250})\/angular(?:\.min)?\.js' - error parsing regexp: invalid repeat count: `{0,250}`
after json parse: 'x-frc-client","js-(\d+(\.\d+)+)', after change: 'x-frc-client","js-(\d{1,250}(\.\d{1,250}){1,250})' - error parsing regexp: invalid repeat count: `{1,250}`


after json parse: 'pubtech-cmp-v(.+?)(?:-esm)?\.js;version:\1', after change: 'pubtech-cmp-v(.{1,250}?)(?:-esm)?\.js;version:\1' - error parsing regexp: invalid escape sequence: `\1`
after json parse: 'v([\d\.-\w]+)', after change: 'v([\d\.-\w]{1,250})' - error parsing regexp: invalid escape sequence: `\w`

after json parse: '(?<!elo\.io)/cargo\.', after change: '(?<!elo\.io)\/cargo\.' - error parsing regexp: invalid named capture: `(?<!elo\.io)\/cargo\.`

@GeorginaReeder
Copy link

Thanks for your contributions @scottdharvey , we appreciate it!

@ehsandeep ehsandeep requested a review from Ice3man543 February 14, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants