Improve Regexp Matching & Version Capture #123
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BLUF: These changes stopped 206 regexp.Parse() errors, and presumably added 206 cases where versions can now be captured where they could not be previously.
I noticed that under the
compileFingerprint()
function, errors fromParsePattern
are silenced, and we move on to the next pattern. While this is reasonable for loading signatures that change regularly, I wanted to inspect these errors and see what could be improved.I printed these errors and found 226 patterns that were failing to compile. Further investigation revealed that the majority of these failures were due to two specific regex patterns for capturing versions.
Version Capture Pattern 1
((?:\\d+\\.)+\\d+)
102 occurrences
Version Capture Pattern 2
(\\d+(?:\\.\\d+)+))
104 occurrences
ParsePattern()
applies some cleanup to the expressions to prepare them forregexp.Compile()
.Example
in fingerprints_data.json
embed-any-document(?:\\/js)?(?:\\/embed-public)?(?:\\/pdfobject)?(?:\\.min)?\\.js(?:\\?v(?:er)?=((?:\\d+\\.)+\\d+))?\\;version:\\1
after json parse
embed-any-document(?:\/js)?(?:\/embed-public)?(?:\/pdfobject)?(?:\.min)?\.js(?:\?v(?:er)?=((?:\d+\.)+\d+))?
after ParsePattern() changes
embed-any-document(?:\\/js)?(?:\\/embed-public)?(?:\\/pdfobject)?(?:\.min)?\.js(?:\?v(?:er)?=((?:\d{1,250}\.){1,250}\d{1,250}))?
A Version Capture pattern gets changed in this process:
((?:\\d+\\.)+\\d+)
((?:\d{1,250}\.){1,250}\d{1,250}))
error
error parsing regexp: invalid repeat count: '{1,250}'
Changes
1 - Remove erroneous pattern change
This change results in what I believe was in unintended change to patterns. For example:
in file
(?:((?:\\d+\\.)+\\d+)\\/)?chroma(?:\\.min)?\\.js\\;version:\\1
after json parse
(?:((?:\d{1,250}\.){1,250}\d{1,250})\\/)?chroma(?:\.min)?\.js\\;version:\\1
You'll notice that right before
)?chroma
the pattern is\\/
If we replace
/
with\\/
then we end up with\\\\/
which gets changed back to\\/
This would require a string like
/ajax/libs/chroma-js/2.4.2\/chroma.min.js
(notice the\/chroma
)I removed this
ReplaceAll
line. If it is put back, you will notice that some of the added test cases underTestExtractVersion
will fail2 - Preserve and modify version capture strings
I preserved the two version capture strings listed above using a pattern similar to "escapedPlus". After the existing string replacements to the pattern, I restored these two patterns with slight modifications. I thought 250 integers per section in a version was excessive, so I limited to 20 (probably still excessive).
Old:
(?:((?:\d{1,250}\.){1,250}\d{1,250})\\/)?chroma(?:\.min)?\.js
New:
(?:((?:\d{1,20}\.){1,20}\d{1,20})\/)?chroma(?:\.min)?\.js
These changes stopped 206
regexp.Parse()
errors, and presumably added 206 cases where versions can now be captured where they could not be previously.The remaining 20 errors are as follows, divided by general parse error: