-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request] Faster HTML escaping? #2885
Comments
And complexity. And potential for the performance to be pulled out from under you in the future. :-) I have some strong opinions on things like this (big picture). I tend to be a skeptic. :-)
I worry the win wouldn't be nearly as profound with us (in terms of total time). We spend a lot more time parsing (WAY more for auto-detect) than we do escaping HTML... and actually there are other huge wins we could consider IF we cared more about performance - before something like this that increases the bundle size, adds complexity to what should be a trivial function, and can mortgage your future to save your present. But for a second let's pretend optimizing the escaping was highly relevant... Right now If the metric was only "time spent escaping" I've just made a very common use case 189x faster, seriously! (Actually even faster since I "cheated" and also removed HTML generation). And no need for gnarly replaceHTML code. :-) Of course this would scale linear to the number of languages one is auto-detecting... The more languages the bigger the win. That's actually probably very-much something we should consider. :-)
At first glance I dislike everything about this kind of patch. :-) Replacing items in a string should be
The first might be faster (strings vs regex)... the latter might be but I wouldn't bet on it because now the speed of JS function callbacks (and the table lookup you need) might hurt you. The only thing that I could imagine that should/could possibly faster would be So anytime I see something like this I ask myself: What is so broken with The why matters. Let's say replace is slow... so we use this crazy convoluted code. We get some win now, but at the cost of tons of complexity and size. Then finally someone fixes I see this kind of talk a lot with the new ES6/2018 stuff. "Iterators are so slow, don't use them". I believe they will get faster, the compilation engines will improve. Engines have had a LONG time to optimize the hell out of ES5 code... so if iterator is a much simpler solution I'd rather use that and trust the engines to catch up later. So I tend to worry if no one can explain WHY something is faster when it seems very intuitive the library functions should be fastest... and also I fear there could be differences here across engines - that should be tested. Is it possible On a smaller note things like this set off my spidey sense really fast: function escapeHTML(...)
const match = (attr ? ATTR_REGEX : CONTENT_REGEX).exec(html);
if (!match) return html; Anytime I see people guard using CPU burn in an attempt to avoid CPU burn... so many times this is the wrong call - or at least very input dependent - and a lot of people never stop to ask those questions. If 99% of highlighted code includes one of those symbols that needs to be escaped then this becomes an entirely wasted check - we might as well just do the replace operation. On the other hand if most code does not then it's likely a win. But OTTOMH I'd say a lot of code snippets (esp. if medium or large) are likely to contain a character needing escaping... given random input though this a complete coin flip - and if that type of optimization was a good idea (as a general 50/50 bet) then I'd expect Ok point 2 probably covered a bit much. :-) It'd be fun to focus more on performance with Highlight.js, but to do it "right" I think that would require some dedicated CI servers (at the very least CPU locked VPSs) and a very specific suite of performance tests - to make sure we're actually moving forward, avoiding regressions, etc... short of that I just kind of keep an eye on the time the tests suite takes to run and as long as its "sorta the same" I'm happy. |
Now it's possible there are answers to WHY. It's possible the ECMAScript spec requires very specific behavior from replace (though I don't know why it would preclude any beneficial performance optimizations) that "ties it's hands" so to speak. If that were determined then this kind of thing would be slightly more interesting... but still not as much as you might think... it seems like the kind of thing better suited to a tiny (well maintained) JS library IE, replaceHTML is really more "utility" than our bread and butter. I'm sure such libraries already exist, I just don't know if any are small and performance focused. :) But honestly speed has never been a super high priority vs reliability and consistency. In many common use cases we're way, way past "fast enough" so micro tuning at the cost of code complexity is a poor tradeoff. |
I just tried removing all the wasted extra HTML generation and HTML encoding for our auto-detection. Our auto-detect test suite runs over 36,000 combinations of highlighting. So we're looking for the 189x faster encoding and HTML generation... and drum roll:
So even if that was a big win it's such a small total part of our time spent as to be irrelevant in the big picture. Oh but it's worse. :-) Our auto-detect tests IGNORE the output (they don't care about HTML at all)... they are only looking for "I expected this to be JS, was it detected as JS"... so actually I just eliminated 100% of the HTML generation and encoding for the auto-detect test suite with 0 measurable impact. :-/ Your library must be insanely fast at parsing if you can find 13% wins in the HTML encoding alone. |
Seems like a faster HTML escaping doesn't help HighlightJS. Just wanted to let you know in case it did. As you correctly said, maintainability is a big factor as well but sometimes the tradeoff is worth it ;)
Oh yes, they do. I'll close the issue as it doesn't seem like this will benefit HighlightJS. |
Did you have any comments/thoughts/insight on WHY it's faster? Seems very strange to me that |
Functions that only uses basic control structures ( |
Ok, we have 5 calls - so we are iterating over the string 5 times. I get that. It's not 100% apparent to me that once would be faster (C++ code vs JS code) - which is why I didn't change it (I've though about it). But even if so replace already does that. Of the top of my head: code.replace(/[&<>"'/g, (m) => { RAW_to_ESCAPED[m] } Actually (I just looked again) that's exactly what they were doing initially - a single pass, but with switch/case instead of an object lookup. Again I'm expecting I'm not sure your benchmark was comparing apples to apples... If you have the time (or inclination) I'd be very curious if you could benchmark just: .replace(/&/g, '&').replace(/</g, '<').replace(/\u00a0/g, ' '); vs // use a lookup table and callback
.replace(/[&<\u00a0/g, (m) => { RAW_to_ESCAPED[m] } vs // crazy version My strong suspicion is the middle version wins. #benchmarkingishard My worry is perhaps you were really only benchmarking the "do it 3 times" vs "do it once". |
Well damn. Now off I go down the rabbit hole of WHY. |
@RunDevelopment Well thanks for that little diversion, it was fun. :-) When you look closely you see they use the regex as a sort of "guard" to get the first replacement position but then they loop over the string one character at a time. Which immediately led me to all sorts of good questions:
So I'm more confused than ever about my original question... Why is |
Right now, HTML escaping is implemented like this but there are faster methods out there. This can speed up highlighting quite a bit. I.e. Prism is about 13% faster in total with the faster escaping but Highlight.JS should make its own performance tests.
I also want to point out that this is very much a tradeoff: execution speed vs bundle size.
The text was updated successfully, but these errors were encountered: