Optimise strings.ToLower() #1170

soujanyanmbri · 2024-10-09T13:45:45Z

Thank you for contributing to Coraza WAF, your effort is greatly appreciated
Before submitting check if what you want to add to coraza list meets quality standards before sending pull request. Thanks!

Make sure that you've checked the boxes below before you submit PR:

My code includes positive and negative tests.
I have an appropriate description with correct grammar.
I have read Contribution guidelines, maintainers note and Quality standard.
My code is properly linted and passes pre-commit tests.

Thanks for your contribution ❤️

soujanyanmbri · 2024-10-09T13:49:29Z

Benchmark:

BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Lowercase_Sentence-10         	40185970	        29.50 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Uppercase_Sentence-10         	17557453	        66.52 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Mixed_Case_Sentence-10              	18643068	        64.06 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Non-Alphabetic_Characters-10        	46125682	        26.00 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Greek_Alphabet-10                 	29133343	        41.43 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Cyrillic_Alphabet-10              	26644413	        44.82 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Mixed_Greek_and_ASCII-10          	18659374	        66.00 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Emoji-10                          	19809511	        59.83 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Empty_String-10                           	570688939	         2.225 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Punctuation-10                       	196423014	         6.113 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Whitespace-10                        	233590377	         5.154 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Long_Mixed_Case_String-10                 	 6503064	       183.7 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Special_Turkish_Case-10                   	16458063	        72.57 ns/op	      80 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Lowercase_Sentence-10             	15182882	        80.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Uppercase_Sentence-10             	14452549	        82.19 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Mixed_Case_Sentence-10                  	14902519	        81.55 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Non-Alphabetic_Characters-10            	16164508	        75.16 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Greek_Alphabet-10                     	12417586	       103.0 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Cyrillic_Alphabet-10                  	10973074	       103.6 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Mixed_Greek_and_ASCII-10              	15417500	        78.68 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Emoji-10                              	15793966	        77.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Empty_String-10                               	232960081	         5.279 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Punctuation-10                           	99651907	        12.02 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Whitespace-10                            	100000000	        11.05 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Long_Mixed_Case_String-10                     	 6063015	       195.0 ns/op	     448 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Special_Turkish_Case-10                       	13682554	        99.41 ns/op	     160 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Lowercase_Sentence-10            	20465416	        62.36 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Uppercase_Sentence-10            	 7061458	       172.2 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Mixed_Case_Sentence-10                 	 9851926	       129.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Non-Alphabetic_Characters-10           	24569835	        50.57 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Greek_Alphabet-10                    	 1287247	       929.4 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Cyrillic_Alphabet-10                 	 1000000	      1035 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Mixed_Greek_and_ASCII-10             	 2851934	       421.8 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Emoji-10                             	 4843082	       246.4 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Empty_String-10                              	568174978	         2.111 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Punctuation-10                          	84454507	        14.13 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Whitespace-10                           	100000000	        11.25 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Long_Mixed_Case_String-10                    	 3768148	       318.0 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Special_Turkish_Case-10                      	 3724746	       318.9 ns/op	      80 B/op	       1 allocs/op

codecov · 2024-10-09T13:49:48Z

Codecov Report

Attention: Patch coverage is 97.26027% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.50%. Comparing base (4ff1f76) to head (9b2743b).
Report is 153 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/seclang/rule_parser.go	71.42%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
+ Coverage   82.72%   83.50%   +0.78%     
==========================================
  Files         162      166       +4     
  Lines        9080     9632     +552     
==========================================
+ Hits         7511     8043     +532     
- Misses       1319     1338      +19     
- Partials      250      251       +1

Flag	Coverage Δ
default	`83.50% <97.26%> (+5.66%)`	⬆️
examples	`83.50% <97.26%> (+57.07%)`	⬆️
ftw	`83.50% <97.26%> (+36.13%)`	⬆️
ftw-multiphase	`83.50% <97.26%> (+33.95%)`	⬆️
tinygo	`83.50% <97.26%> (+8.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jcchavezs · 2024-10-09T18:57:50Z

internal/strings/strings.go

+// asciiToLowerInPlace converts ASCII characters in the string to lowercase
+// starting from the specified index.
+func asciiToLowerInPlace(s string, start int) string {
+	res := []byte(s)


Not sure if you have access to medium but this can be done more efficiently https://josestg.medium.com/140x-faster-string-to-byte-and-byte-to-string-conversions-with-zero-allocation-in-go-200b4d7105fc

"returned byte slice MUST NOT be modified since it shares the same backing array with the given string."
Not sure, if this is the best use here, i tried to copy the byte slice before modifying it, and it is causing unexpected behaviour.

internal/strings/strings.go

M4tteoP

I'm not per se against this change (performance is a key concept), but I would like to get a better understanding of the tradeoff we have in order to gain this extra performance.

Being just meant for ASCII chars, what are the transformations that we would not perform anymore? (First of all, I'm thinking at chars like ÀÈÉÖ).
Should we restrict the usage of the new stringsutil.AsciiToLower for comparisons where we expect to deal with ASCII and keep the standard one to deal with generic inputs? (but I get that it would not really be that beneficial in terms of performance gains)
Otherwise, we should consider it a breaking change, shouldn't we? Running the official ToLower tests with the new function might give more clarity about the behavior's change: https://github.com/golang/go/blob/master/src/strings/strings_test.go#L640-L650.

Thanks!

jcchavezs · 2024-10-10T10:25:33Z

@M4tteoP I think the main assumption here is that variable keys and collection names will always be ascii.

soujanyanmbri · 2024-10-10T10:35:02Z

Hey @M4tteoP ,
I've added a test case for unicode cases by the way:

coraza/internal/strings/strings_test.go

Line 107 in dd8793d

    
           {"Unicode Unchanged", "Привет Мир", "Привет Мир"},                   // Unicode characters remain unchanged

As @jcchavezs mentioned, the main assumption here is that variable keys and collection names will always be ascii, and this should not cause coraza to miss any evasions as well.

soujanyanmbri · 2024-10-23T02:19:01Z

Hey @jcchavezs @M4tteoP ,
Are there any additional checks i should be doing?

Thanks!

change of mind

jcchavezs · 2024-10-28T08:40:38Z

@soujanyanmbri sorry for the back and forth.

after a chat with anuraaga and reading golang/go#17859 I was wondering if this change is truly needed. On one hand I believe we should only apply to those places where we truly believe this is going to improve performance, on the other hand I think benchmarks must be done with real data e.g. real strings we are optimizing for rather than other kind of input to see real performance difference as the assumption was:

I think the main assumption here is that variable keys and collection names will always be ascii.

and not everywhere. How does that sound?

M4tteoP · 2024-10-28T09:33:36Z

My main concern here is if we can be sure about the assumption. Assuming only ASCII, we skip lowercase conversions for some UTF-8 characters that have lowercase representations. A trivial example that would lead to a change compared to current detection is the following:

SecRule REQUEST_BODY "@contains è" "id:101,phase:2,deny,t:lowercase,status:403,msg:'Invalid char',log,auditlog"

~/Repo/Coraza  18ae0d83(main) ✔
▶ curl -i 'localhost:8090' --data "È"
HTTP/1.1 403 Forbidden

[DEBUG] Evaluating operator: MATCH tx_id="HTxystXLQrrxvxSxjua" rule_id=101 variable="REQUEST_BODY" operator_function="@contains" operator_data="è" arg="è"

▶ git checkout soujanyanmbri/main
▶ curl -i 'localhost:8090' --data "È"
HTTP/1.1 200 OK

[DEBUG] Evaluating operator: NO MATCH tx_id="nmvdcNgWOFemGsrovLB" rule_id=101 variable="REQUEST_BODY" operator_function="@contains" operator_data="è" arg="È"

soujanyanmbri · 2024-11-08T13:30:34Z

Hey @M4tteoP and @jcchavezs , made a minor fix, Added a check where we rollback to strings.toLower() in case we encounter any unicode character. What do you guys think ?

goos: darwin
goarch: arm64
pkg: github.com/corazawaf/coraza/v3/internal/strings
cpu: Apple M1 Pro
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Lowercase_Sentence-10         	25895461	        44.97 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Uppercase_Sentence-10         	18426256	        65.55 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Mixed_Case_Sentence-10              	20168433	        59.82 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Non-Alphabetic_Characters-10        	33194751	        36.07 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Greek_Alphabet-10                 	 1306746	       915.9 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Cyrillic_Alphabet-10              	 1000000	      1043 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Mixed_Greek_and_ASCII-10          	19852059	        61.32 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Emoji-10                          	 5101520	       236.6 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Empty_String-10                           	482705482	         2.668 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Punctuation-10                       	154285611	         7.949 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Whitespace-10                        	198415596	         6.275 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Long_Mixed_Case_String-10                 	 7410427	       168.4 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Special_Turkish_Case-10                   	18378188	        67.35 ns/op	      80 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Lowercase_Sentence-10            	19861150	        60.44 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Uppercase_Sentence-10            	 5104068	       216.2 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Mixed_Case_Sentence-10                 	 9451922	       129.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Non-Alphabetic_Characters-10           	19565535	        56.18 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Greek_Alphabet-10                    	 1294094	       933.0 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Cyrillic_Alphabet-10                 	 1000000	      1020 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Mixed_Greek_and_ASCII-10             	 3142808	       384.8 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Emoji-10                             	 5080602	       234.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Empty_String-10                              	460176516	         2.603 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Punctuation-10                          	72635012	        16.56 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Whitespace-10                           	90726379	        13.16 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Long_Mixed_Case_String-10                    	 3842942	       316.8 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Special_Turkish_Case-10                      	 4221205	       285.5 ns/op	      80 B/op	       1 allocs/op
PASS
ok  	github.com/corazawaf/coraza/v3/internal/strings	38.532s

Optimise strings.lower()

aaed653

soujanyanmbri requested a review from a team as a code owner October 9, 2024 13:45

jcchavezs reviewed Oct 9, 2024

View reviewed changes

internal/strings/strings.go Outdated Show resolved Hide resolved

M4tteoP reviewed Oct 9, 2024

View reviewed changes

Resolve comment

dd8793d

soujanyanmbri requested review from jcchavezs and M4tteoP October 10, 2024 11:05

soujanyanmbri added 2 commits October 16, 2024 13:16

Merge branch 'main' into main

d01a687

Merge branch 'corazawaf:main' into main

45223de

This comment was marked as outdated.

Sign in to view

soujanyanmbri added 5 commits November 7, 2024 08:24

Merge branch 'corazawaf:main' into main

fe9ec06

Merge branch 'main' into main

e239473

Merge branch 'main' into main

3a6c375

Rollback to strings.Lower in the case of unicode

4c1f7e6

Merge branch 'main' of github.com:soujanyanmbri/coraza

9b2743b

soujanyanmbri requested a review from jcchavezs November 8, 2024 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise strings.ToLower() #1170

Optimise strings.ToLower() #1170

soujanyanmbri commented Oct 9, 2024 •

edited

Loading

soujanyanmbri commented Oct 9, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

jcchavezs Oct 9, 2024

soujanyanmbri Oct 10, 2024

M4tteoP left a comment •

edited

Loading

jcchavezs commented Oct 10, 2024

soujanyanmbri commented Oct 10, 2024

soujanyanmbri commented Oct 23, 2024

This comment was marked as outdated.

jcchavezs commented Oct 28, 2024

M4tteoP commented Oct 28, 2024 •

edited

Loading

soujanyanmbri commented Nov 8, 2024

Optimise strings.ToLower() #1170

Are you sure you want to change the base?

Optimise strings.ToLower() #1170

Conversation

soujanyanmbri commented Oct 9, 2024 • edited Loading

soujanyanmbri commented Oct 9, 2024

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

jcchavezs Oct 9, 2024

Choose a reason for hiding this comment

soujanyanmbri Oct 10, 2024

Choose a reason for hiding this comment

M4tteoP left a comment • edited Loading

Choose a reason for hiding this comment

jcchavezs commented Oct 10, 2024

soujanyanmbri commented Oct 10, 2024

soujanyanmbri commented Oct 23, 2024

This comment was marked as outdated.

jcchavezs commented Oct 28, 2024

M4tteoP commented Oct 28, 2024 • edited Loading

soujanyanmbri commented Nov 8, 2024

soujanyanmbri commented Oct 9, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

M4tteoP left a comment •

edited

Loading

M4tteoP commented Oct 28, 2024 •

edited

Loading