-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use regex for ALL search: 30% faster #28
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, and thank you so much for this!
Please make sure you run npm test
and fix up the failing unit tests. We want to be sure that everything is passing before merging in.
This module has a benchmark suite and testing master
vs this PR, it shows that this PR is overall slower, but especially when there are lots of special characters to escape. You can run npm run bench
.
master
:
no special characters x 17,654,390 ops/sec ±1.35% (185 runs sampled)
single special character x 5,221,477 ops/sec ±1.84% (178 runs sampled)
many special characters x 2,653,137 ops/sec ±0.88% (182 runs sampled)
PR #28
no special characters x 17,720,386 ops/sec ±1.04% (187 runs sampled)
single special character x 4,523,898 ops/sec ±1.03% (186 runs sampled)
many special characters x 668,211 ops/sec ±0.97% (186 runs sampled)
If there is an edge case or two those can be fixed... but first we should sort out the performance discrepancy. I'm the maintainer of Highlight.js so I was working within that context. My test sample (for benching) was a 1.2mb Javascript file of Highlight.js compiled with all languages. This means there are plenty of The Prism workload was 75% faster and the Highlight.js workload 30% faster. 30% is an impressive number or I never would have opened this PR. I'm finding it hard to comprehend the test results (so opposite my own)... which leads me to notice a few things and ask a few questions... I worry we're comparing apples to oranges and first we need to agree on what the fruit is. :-)
I feel like once we have more representative samples we may have to prioritize which tests cases are more important. Encoding actual real HTML or text might be way faster with this algorithm while encoding payloads with 100% special characters much slower (since that is a worst case scenario). And of course it makes sense to benchmark the worse case scenario, but not to make all decisions based on it. I feel like perhaps we first need to agree on what representative test samples are.
Perhaps we need some tests of 1k samples, 10k samples, 100k samples? Repeating the same test 100 times is not at all the same as testing a string 100x longer. Perhaps some ACTUAL real life samples of source code, HTML files, etc? From what I've see it looks like those results may be VERY different from the results of the current benchmark run you posted in response. Thoughts? |
Thank you for all those questions. One things I wanted to mention, though, is I think you are incorrectly assuming I actually wrote the benchmarks and the current code; both of these things have been PRs just the same as you are making a PR to get changes landed. So I cannot really answer any of your questions relating to why I did x or y, as I likely have done neither, so would have no answer for you. I think perhaps you may need to rewrite all the above that removes anything that would be asking why I did something one way or another and instead work towards asking objective questions rather than subjective so I can actually provide answers. |
This test specifically makes me feel the string size is skewing things.... because for a single character what is happens is so simple:
It seems is impossible to believe that "loop over whole string checking every character one at a time" in JS is faster than the regex engine running once (compiled C++ code). And indeed with a 1.2mb sample of actual code, it's not at all. It's demonstrably slower. I'm betting the existing tests are measuring some "start up" penalty of the regex engine... (making every single one of them truly a "worst case" test, not a realistic one) so if there are only 10 characters, a tight loop wins... but if there are 100, 1000, 10000 characters... the question is where is that line. The thing that led me down this rabbit hole is: Which is faster regex or for loop? And if it's clear - why use both? Why use the regex to find only the first match? If regex is slow why use it at all; just walk the whole string. I have a sneaking suspicious for these tiny strings it might be even faster if you removed the regex completely. It's the first thing I tried - but for larger payloads its way worse - because the regex engine is much faster. Sorry for writing a book. :) |
Hi, thank you for your response. It does not seem like you are interested in having an objective discussion on this, even bolding out asking why I did specifically did something. If you cannot have a civil, objective discussion, there is likely nothing further to discuss here. Please let me know if you are going to update your comments to address this or not to know if there is point on going forward. |
I totally am. sighs 😢 Sometimes I really really hate the internet's complete lack of tone.
I bolded "And if it's clear - why are use both?" I specifically removed the you (forgot and left the "are") when proofing to AVOID it sounded accusatory. So it should have said "why use both"? It's just a question (a key question hence the bold) not an accusation.
A civil objective discussion is exactly what I wanted. I think I did assume you had written the code but nothing I said was intended to blame or slander - even if you had written it all.
What do you want me to change exactly and I will? going back to reread it now (edited to replace bold with italics) |
Do just want me to remove all the "you" and "your" and replace with "this code"? Done. Again no slight was implied or intended. To me "your library", "my patch", "your code", "my code", etc. is just the most natural way to refer to which I'm talking about. I do understand how it implies ownership yet it was not intended to be offensive and I truly apologize. |
Meanwhile over in the SSR thread on this topic:
I'm not sure how much complexity you want to allow in the library but it could be that we could have our cake AND eat it as well by using one algorithm for short snippets and a regex powered solution for longer ones. |
Ok so now I'm using the included benchmark suite (with more tests) and I switched to a likely more representative sample (a large page of HTML from Github vs Javascript code). I'd think HTML would be a realistic real world scenario for an HTML escape, yes? Let's walk thru a few comparisons. What are we comparing:
I don't always include the latter as it's really only interesting vs the original implementation in one case. The results:
First 650kb of HTML from a Github discussion thread... Using regex is 13% faster.
That same webpage but truncated to 30kb... and now regex is 11% faster.
I'm not sure why this is reversed from your original results (perhaps because I switched to
The full on regex is now winning handily at finding a single needle in a haystack. 3.7x faster. This makes sense since we're avoiding iterating thru the last half of the string one char at a time. (the needle was inserted in the middle) If it's placed at the beginning the performance gets much worse and if it's at the very end then we'll start to approach the speed of "no special character" since the for loop will iterate so few times. Now lets look at "many special characters" when is indeed where regex falls on it's face... the the original is slow here if compared to not using regex at all:
As I guessed, almost 25% faster to not use regex at all... and using regex exclusively now costs us, so we run at only 35% of the speed of iterating a single character at a time.
So of course this bears out with larger samples as well. Note: I also purposely excluded "no special characters" from discussion here since both implementations use regex and should be effectively the same speed (as the original benchmark you ran pretty much showed). So I think it boils down to what type of input data is the most common or realistic use case...
Thoughts? Also: All tests are green now (on my machine). I pushed EVERYTHING so you'd be able to run the same tests. If we agree on a path forward I can of course clean up the PR to remove all the extra code, etc. Though if the HTML test is reasonable we need a location to keep it, etc. |
I'm going to weigh-in because I use this lib. I think the current balance is spot on. Faster than iteration short-circuit for fragments that don't need escaping and faster-than multi-match Regexp iteration for fragments containing a realistic percentage of real-world entities in a typical document (both prose and html). Please don't change it! |
@@ -0,0 +1,11747 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can include this content here according to the posted policy at https://docs.github.com/en/free-pro-team@latest/github/site-policy/github-terms-of-service#f-copyright-infringement-and-dmca-policy
You may not duplicate, copy, or reuse any portion of the HTML/CSS, Javascript, or visual design elements or concepts without express written permission from GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. :-) Didn't even think of this. Can you think of another site that perhaps wouldn't care if we borrowed an HTML sample without just making one up? I'm not sure GitHub would come after us for this, but if you want to err on the side of caution I also understand that. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to add a "source code" sample test would any large OSS project with a permissive license work? (ie Vue, React, etc?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a Svelte maintainer and would be happy if you want to use svelte.dev or kit.svelte.dev. Our sites are MIT licensed
@@ -20,7 +20,7 @@ var matchHtmlRegExp = /["'&<>]/ | |||
* @public | |||
*/ | |||
|
|||
module.exports = escapeHtml | |||
module.exports = { escapeHtml, escapeHtmlFast, escapeHtmlNoRegex } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure your syntax works on the Node.js versions declared in our package.json
; generally ES5. If there are now multiple exports, they of course should all be documented on the README.
I also see there is no longer any default export, definately making this a major version bump, at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I said:
If we agree on a path forward I can of course clean up the PR to remove the extra code, etc.
I pushed what I had so that anyone could easily test it exactly as I had. (see both results side by side, tweak both, etc) Unless we decide to change the API these changes don't need to happen. I'm just first trying to get agreement on what we're doing here.
generally ES5.
Ah, I'm spoiled by our ES6 codebase, but I can fix. (or again, simply revert)
I also see there is no longer any default export, definately making this a major version bump, at least.
I'd agree if we are going to have multiple functions.
Thank you so much for weighing in @makeable ! I also use this module all over the place in various other npm modules as well, all of which pass through it very short strings, most of which contain nothing to escape or a few characters to escape. Certainly keeping that use-case fast would be a priority for me as well, but I think if we can improve the performance of very large inputs, we should do that too. But looking over the pull request, I don't think we should do that at the expense of an API requiring the user to choose which method to use. Just like how the v8 engine uses two different sorting algorithms for the |
The short-circuit isn't going anywhere - both implementations short-circuit. Could you weigh in on what you mean by "prose" exactly? As I've shown actual HTML (I think a complex GitHub page would be pretty good reference sample) is 10-13% faster. Perhaps a huge section of Brainfuck could be slower with this new way... but:
It's possible there is a use case that contains a HUGE % of escapable characters, but no one has mentioned what it might be yet. :-) We don't disagree. :-) The algorithm (unless it can detect and run two paths) should be optimized for "a realistic percentage of real-world entities". The question right now is what exactly is "a realistic percentage of real-world entities" for the "common" 80/20 case... If the primary goal is escaping actual HTML it looks like right now the new algorithm has an edge. |
Just to be clear the latest benchmark does not seem to show a penalty for small inputs against the original method. It's speed seems consistent short or small (making my original thinking wrong). So the "mostly text, short input" case should still be equally fast. There is definitely a huge win for regex with larger inputs though. As the string gets larger the regex method gets faster and faster (since regex is better at covering the vast amount of ground) - again assuming a "HTML" level of characters needing to be encoded (or less). The only downside I'm seeing now is highlighting what to me seems slightly "artificial" input with a huge amount of characters needing to be escaped: "<<<<<<<<<<<<<<<<<<<". Unless someone pipes up with real life examples that this is a very common use case I'd suggest that's it's the edge case. :-)
I was never recommending this although I'm also not 100% opposed. I'd name them differently though separating
This is what I considered at first - when I thought the issue was length. If the performance we skewed by size then we could just figure out where the "boundary" was and have a The best we can do is a regex short-circuit ensuring content with 0 characters to be encoded is as fast as can be - which both algorithms already do. :)
Oh please no. :) I was never suggesting that "loop over EVERY character without regex" be a serious contender - only trying to make the point that a lot of different algorithms have different performance - all depending on the characteristics of the input. |
If we can reach an agreement (that highlighting actual HTML, code, or textual-ish content is the primary use) case then this PR can be simplified to a tiny, tiny PR that just swaps out the core function with the faster one (ie, like it was originally- but with all tests and linter green). Again the current "state" of the repo is just to allow anyone who wants to reproduce my results to do so easily (or expand the test cases). This really boils down to does this library aim to solve the 80/20 or 90/10 case super-well or is the ambition larger - to provide a hammer, a screwdriver, and a pair of pliers all at the same time - to cover every conceivable scenario? ;-) |
I guess I should mention a few of the use cases I'm thinking of (or have personally):
All these cases would use something like |
By prose, I mean "non-html content". <-- this very string for example! I'm confused by your use cases to be honest. Why do you want to encode all entities in a 1.2MB HTML document? Then it is no longer HTML. If it is for display in a code block, for example, then you aren't going to be using HUGE documents. Even so, html contains many entities to be encoded, so the multi-match mechanism is inefficient compared to simple iteration. If it is for safely outputting user-generated content, then it is generally working on textual content, which will be sparsely populated with entities (if at all), and generally only a few paragraphs. In this case the short circuit works well, as much prose doesn't contain entities, and for those that do, it will be sparsely populated, and generally in pairs (quotes for example), meaning the multi-match mechanism is less efficient. From the benchmarks given, I am not seeing any benefit over what already exists. Maybe it is worth putting together a jsperf or something, to demonstrate for a number of use case, rather than trying to submit this as a PR? |
That's the best case (nothing to escape) and is equally fast in both iterations. Although it can be quickly slow down with the current implementation as soon as the content starts including any special characters at all. The new implementation performs best the sparser the content since that's when regex really shines.
It was just the first test case I had laying around. :-) Don't get too focused on that - as I show above the performance gains seem to hold for small snippets. The new implementation just gets faster and faster the larger the content. Stepping over a string one character at a time is very inefficient (unless you need to replace almost all the characters).
See sites like GitHub - sometimes you indeed do have a LARGE file with syntax coloring... But lets forget 1.2mb... The benchmarks clearly show HUGE (3.7x) speedups for only 13kb of data in optimal cases.
Right now RegExp.test` seems 10-12% faster for HTML. See the benchmarks which I cited above and provided (so that you can easily reproduce should you so choose). I'll inline it again for convenience:
HTML is processed 13% faster with the implementation proposed in this PR. If you have an opposing benchmark or some issue with the benchmark itself (as provided in the PR) please be specific.
Based on what are you saying it's less efficient? This statement simply does not agree with the benchmark data. The more sparsely populated the faster regex gets and the slower character by character. For "single character" currently the new algorithm has a slight edge for small snippets and the larger the snippets get the bigger the performance gains. Again 3.7x faster for a small, optimal 13kb snippet.
Are we looking at the same benchmarks? Again to summarize what the benchmarks above show:
The attached benchmarks show this clearly - I shared the numbers. If you believe I'm reading them wrong it would be helpful for you to mention a specific case and the issue you're having with the benchmark itself.
That's what Summary Again from what I see right now the existing algorithm seems to only be superior (vs the new one) when given content where almost every letter needs to be replaced. In all other cases (as shown by benchmark), the new algorithm seems equivalent or faster. Either just a little faster, or in some cases a LOT faster. If I'm missing something happy to have someone point to it specifically or provide an opposing benchmark. |
Ok, that's entirely wrong - my bad. So I should clarify. Turns out the "density" of special characters in HTML is just very low. The large HTML sample from Github is 6% special characters (of the entire file). And the new algorithm seems to out-perform the current when the density is less than 10%. (for small samples)
The larger the data size, the better... so here with 10k of data we're still 40% even when the density has risen to 14%.
|
Actually... I am going to have to concede. I run your multi-match against 300k user-generated statements (generally tweet-sized). It is 30% faster for my use case in both node and browser. I would be happy to see this change, but I do not like the idea of choosing between 3 different strategies. If anything, this should simply replace the existing one. My suggestion would be to create a PR with just changes to the benchmark, to include real-world scenarios (tweets, chat statements, small code snippets, textual blog content, etc). Randomly generating data (and the current tiny strings approach) is not a fair assessment- hence my initial apprehension. Then the maintainer can make an objective assessment of the benefits. But... I do like it - sorry for being a blocker 👍 |
Actually you could maybe have your cake and eat it too (for larger snippets)... though I'm not sure I have time to pursue that angle. But the idea:
This would mean you'd never get the performance improvements for small snippets (less than whatever the analysis window is) though, only larger ones. |
Awesome! HTML seems to be a real sweat spot for the new code.
Yes, that's the hope. The PR is only in its current state to make benching the multiple algorithms easier. We should only have a single happy path (unless someone wants to pursue the idea I just posted). If the maintainer agrees this is the right direction I can update the PR to just change the existing function and little else.
I've already spent way more time than I intended on this. If this needs a LOT of additional validation/research it may have to wait for someone else to pick it up on a different day. I do agree though, a folder of |
To confirm - my usecase isn't escaping HTML. It is escaping text (which occasionally contains quotes, etc) for display in html. I would expect this to be a much more prevalent use-case than actually escaping html code. For example, in a web view for a chat application, you could expect in excess of a hundred instances of escaping per second (one per chat statement) while rapidly scrolling through a chat history (escaping those 100 instances account for 0.1ms on low-end devices with the current solution). It seems any use-case for HTML escaping would generally be limited to displaying code, which I would assume occurs a lot less frequently. (Your speedup mentioned above for the 650KB document accounts for less than 0.25ms, for example) Anyway - as mentioned, for my text-based requirements escaping via the regex multi-match works well - so Im happy for a 30% performance boost - although we are talking 0.3μs difference, and it occurs on the client, so I don't care THAT much. But I think you need to realise what the primary usecase for this lib is - and I don't think its actually escaping HTML (despite the name). Also, please keep it es5 - its more portable that way. |
No intention to change ES5 support. |
Here's an almost standalone benchmark. You can run this from within I added in a naive adaptive version, which switches based on length of the string. The adaptive approach "underperforms" when there is a string which is over the length threshold and contains high density of special chars. The default approach "underperforms" when there is a long string with very low density of special chars. overall the default one is quite fast. I don't think any of the benchmarks so far are conclusive as to an obviously better algorithm. leaving this comment in case anyone else stumbles into it and wants to kick tires or create better benchmarks. |
If regex is so fast that it's worth using at the beginning of the function, why not use it for the stepping?