-
-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise strings.ToLower() #1170
base: main
Are you sure you want to change the base?
Conversation
Benchmark:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1170 +/- ##
==========================================
+ Coverage 82.72% 83.50% +0.78%
==========================================
Files 162 166 +4
Lines 9080 9632 +552
==========================================
+ Hits 7511 8043 +532
- Misses 1319 1338 +19
- Partials 250 251 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
// asciiToLowerInPlace converts ASCII characters in the string to lowercase | ||
// starting from the specified index. | ||
func asciiToLowerInPlace(s string, start int) string { | ||
res := []byte(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you have access to medium but this can be done more efficiently https://josestg.medium.com/140x-faster-string-to-byte-and-byte-to-string-conversions-with-zero-allocation-in-go-200b4d7105fc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"returned byte slice MUST NOT be modified since it shares the same backing array with the given string."
Not sure, if this is the best use here, i tried to copy the byte slice before modifying it, and it is causing unexpected behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not per se against this change (performance is a key concept), but I would like to get a better understanding of the tradeoff we have in order to gain this extra performance.
- Being just meant for ASCII chars, what are the transformations that we would not perform anymore? (First of all, I'm thinking at chars like
ÀÈÉÖ
). - Should we restrict the usage of the new
stringsutil.AsciiToLower
for comparisons where we expect to deal with ASCII and keep the standard one to deal with generic inputs? (but I get that it would not really be that beneficial in terms of performance gains) - Otherwise, we should consider it a breaking change, shouldn't we? Running the official ToLower tests with the new function might give more clarity about the behavior's change: https://github.com/golang/go/blob/master/src/strings/strings_test.go#L640-L650.
Thanks!
@M4tteoP I think the main assumption here is that variable keys and collection names will always be ascii. |
Hey @M4tteoP , coraza/internal/strings/strings_test.go Line 107 in dd8793d
As @jcchavezs mentioned, the main assumption here is that variable keys and collection names will always be ascii, and this should not cause coraza to miss any evasions as well. |
Hey @jcchavezs @M4tteoP , Thanks! |
@soujanyanmbri sorry for the back and forth. after a chat with anuraaga and reading golang/go#17859 I was wondering if this change is truly needed. On one hand I believe we should only apply to those places where we truly believe this is going to improve performance, on the other hand I think benchmarks must be done with real data e.g. real strings we are optimizing for rather than other kind of input to see real performance difference as the assumption was:
and not everywhere. How does that sound? |
My main concern here is if we can be sure about the assumption. Assuming only ASCII, we skip lowercase conversions for some UTF-8 characters that have lowercase representations. A trivial example that would lead to a change compared to current detection is the following:
|
Hey @M4tteoP and @jcchavezs , made a minor fix, Added a check where we rollback to strings.toLower() in case we encounter any unicode character. What do you guys think ?
|
Make sure that you've checked the boxes below before you submit PR:
Thanks for your contribution ❤️