Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(path): allow utf8 chars in path #178

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

joelwurtz
Copy link

@joelwurtz joelwurtz commented Aug 29, 2024

Ref #146

I'm not used to work with simd parser so i have done what i think is right but not sure if it's right in all use case.

Mainly the uri non compliant rfc3986 parser is the same as the current one but allow every char between 128 to 255 which sould make it utf8 compliant, if i recall correctly code point always are above or equal to 128

I did not make a feature but rather an option in parser, however i wonder if this should be true by default ? or if it should be a feature (enabled by default ?) ?

@joelwurtz
Copy link
Author

After playing with this, i think it should be a feature instead of a config, and it should be enabled by default.

Even if it's not really compliant with the RFC there seems to be some browsers that include utf8 in path of a HTTP request and without this it failed, having a feature allow someone to create a strict server but IMO default use case should handle most of web browser.

I also added a new Error if the given utf8 in path is not valid

@joelwurtz joelwurtz force-pushed the feat/allow_non_compliant_char branch from 47380b3 to 6ff4cd6 Compare September 3, 2024 09:05
src/lib.rs Outdated
@@ -963,6 +989,14 @@ pub fn parse_uri<'a>(bytes: &mut Bytes<'a>) -> Result<&'a str> {
return Err(Error::Token);
}

#[cfg(feature = "utf8_in_path")]
// SAFETY: all bytes up till `i` must have been `is_token` and therefore also utf-8.
return match str::from_utf8(unsafe { bytes.slice_skip(1) }) {
Copy link
Author

@joelwurtz joelwurtz Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This induce an overhead in uri benches, so not sure if we want this ? but in this case we are not sure that the string is valid utf8 unless using it (see the relative test)

Copy link
Author

@joelwurtz joelwurtz Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group                    no_utf8                                utf8
-----                    -------                                ----
uri/uri_0001b            1.00      1.1±0.01ns  1786.2 MB/sec    4.58      4.9±0.09ns   390.1 MB/sec
uri/uri_0002b            1.00      1.4±0.01ns     2.1 GB/sec    4.45      6.0±0.34ns   474.7 MB/sec
uri/uri_0004b            1.00      1.9±0.02ns     2.5 GB/sec    4.36      8.1±0.08ns   590.5 MB/sec
uri/uri_0008b            1.00      1.2±0.08ns     6.8 GB/sec    4.98      6.2±0.10ns  1390.9 MB/sec
uri/uri_0016b            1.00      1.8±0.01ns     9.0 GB/sec    2.80      4.9±0.08ns     3.2 GB/sec
uri/uri_0032b            1.00      2.8±0.20ns    11.1 GB/sec    2.05      5.7±0.12ns     5.4 GB/sec
uri/uri_0064b            1.00      5.5±0.75ns    11.1 GB/sec    1.30      7.1±0.04ns     8.5 GB/sec
uri/uri_0256b            1.11     19.1±0.09ns    12.5 GB/sec    1.00     17.2±2.03ns    14.0 GB/sec
uri/uri_0512b            1.36     40.4±0.26ns    11.8 GB/sec    1.00     29.8±3.51ns    16.0 GB/sec
uri/uri_1024b            1.35     80.6±0.53ns    11.8 GB/sec    1.00     59.9±0.34ns    15.9 GB/sec
uri/uri_2048b            1.69    163.7±1.62ns    11.7 GB/sec    1.00     97.1±0.45ns    19.7 GB/sec
uri/uri_4096b            1.95   310.9±22.99ns    12.3 GB/sec    1.00    159.4±0.92ns    23.9 GB/sec

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also try to use simdutf8, but not sure we wants a dep for that, result are better on large string, but no really significant on small string :

group                    no_utf8                                simd_utf8                              utf8
-----                    -------                                ---------                              ----
uri/uri_0001b            1.00      1.1±0.01ns  1786.2 MB/sec    5.42      5.8±0.07ns   329.6 MB/sec    4.58      4.9±0.09ns   390.1 MB/sec
uri/uri_0002b            1.00      1.4±0.01ns     2.1 GB/sec    4.80      6.5±0.02ns   440.1 MB/sec    4.45      6.0±0.34ns   474.7 MB/sec
uri/uri_0004b            1.00      1.9±0.02ns     2.5 GB/sec    4.34      8.0±0.44ns   593.1 MB/sec    4.36      8.1±0.08ns   590.5 MB/sec
uri/uri_0008b            1.00      1.2±0.08ns     6.8 GB/sec    5.58      6.9±0.06ns  1240.3 MB/sec    4.98      6.2±0.10ns  1390.9 MB/sec
uri/uri_0016b            1.00      1.8±0.01ns     9.0 GB/sec    3.47      6.1±0.07ns     2.6 GB/sec    2.80      4.9±0.08ns     3.2 GB/sec
uri/uri_0032b            1.00      2.8±0.20ns    11.1 GB/sec    2.48      6.9±0.12ns     4.5 GB/sec    2.05      5.7±0.12ns     5.4 GB/sec
uri/uri_0064b            1.00      5.5±0.75ns    11.1 GB/sec    1.23      6.7±0.13ns     9.0 GB/sec    1.30      7.1±0.04ns     8.5 GB/sec
uri/uri_0128b            1.21     10.3±0.06ns    11.7 GB/sec    1.00      8.5±1.10ns    14.1 GB/sec    1.19     10.2±0.90ns    11.8 GB/sec
uri/uri_0256b            1.72     19.1±0.09ns    12.5 GB/sec    1.00     11.1±0.30ns    21.5 GB/sec    1.54     17.2±2.03ns    14.0 GB/sec
uri/uri_0512b            2.02     40.4±0.26ns    11.8 GB/sec    1.00     20.0±0.36ns    23.9 GB/sec    1.49     29.8±3.51ns    16.0 GB/sec
uri/uri_1024b            2.48     80.6±0.53ns    11.8 GB/sec    1.00     32.5±1.46ns    29.4 GB/sec    1.84     59.9±0.34ns    15.9 GB/sec
uri/uri_2048b            2.70    163.7±1.62ns    11.7 GB/sec    1.00     60.6±1.06ns    31.5 GB/sec    1.60     97.1±0.45ns    19.7 GB/sec
uri/uri_4096b            2.77   310.9±22.99ns    12.3 GB/sec    1.00    112.4±1.09ns    33.9 GB/sec    1.42    159.4±0.92ns    23.9 GB/sec

src/lib.rs Outdated Show resolved Hide resolved
@@ -2053,7 +2087,7 @@ mod tests {
assert_eq!(parse_chunk_size(b"567f8a\rfoo"), Err(crate::InvalidChunkSize));
assert_eq!(parse_chunk_size(b"567f8a\rfoo"), Err(crate::InvalidChunkSize));
assert_eq!(parse_chunk_size(b"567xf8a\r\n"), Err(crate::InvalidChunkSize));
assert_eq!(parse_chunk_size(b"ffffffffffffffff\r\n"), Ok(Status::Complete((18, std::u64::MAX))));
assert_eq!(parse_chunk_size(b"ffffffffffffffff\r\n"), Ok(Status::Complete((18, u64::MAX))));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated, can remove it, but without this test without default features would fail (since we don't have std)

@joelwurtz joelwurtz changed the title feat(path): allow to config parser to allow non compliant rfc3986 support feat(path): allow utf8 chars in path Sep 3, 2024
@joelwurtz joelwurtz force-pushed the feat/allow_non_compliant_char branch 2 times, most recently from 37533c4 to df0651e Compare September 10, 2024 08:33
@seanmonstar
Copy link
Owner

I asked in the referenced issue to check with a few implementers if this is desirable :)

Copy link
Owner

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think we can do this, but some changes commented inline.

Also, to test all SIMD variants, we can probably just have a unit test that loops all possible bytes of URI_MAP, making a long enough vec with that byte, and then checking the match value is the same as without SIMD. I believe we have a test like that for one of the other byte maps.

Cargo.toml Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
@joelwurtz joelwurtz force-pushed the feat/allow_non_compliant_char branch from a00e411 to 3f068fd Compare September 17, 2024 16:08
@joelwurtz
Copy link
Author

joelwurtz commented Sep 17, 2024

I have apply your comments and squash, will add a test later that will try to iterate over all utf8 char so we are sure to not miss something

EDIT : I don't think we can iterate over all uri_map since some bytes can be only one byte of a unicode char, so it will fail for some of them during the utf8 check, what i want to do is

  • iterate over all ascii char 0 - 127 and test it match the intended behavior
  • iterate over all utf8 char starting from 127 whether they have 1,2 or 3 bytes in them, and also check that not existing utf8 sequence char breaks an error

@joelwurtz
Copy link
Author

joelwurtz commented Sep 18, 2024

I added a test for this, is this what you want or did i misunderstood ?

@joelwurtz
Copy link
Author

Hey, do you need something more for this to be reviewed or merged ?

@earce-pulse
Copy link

+1 looking forward to progress on this!

@lovasoa
Copy link

lovasoa commented Nov 5, 2024

@seanmonstar , do you think this can be merged ? This would really help us over at SQLPage.

@earce-pulse
Copy link

@seanmonstar us as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants