Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unicode dot #376

Merged
merged 9 commits into from
Feb 16, 2024
Merged

Fix unicode dot #376

merged 9 commits into from
Feb 16, 2024

Conversation

RustyYato
Copy link
Contributor

@RustyYato RustyYato commented Feb 16, 2024

Fixes #375

Even when using str, regex(".") cannot use the ASCII fast path because the input string may contain multibyte codepoints. Using the ASCII fast path would split the bytes of these codepoints up, which leads to UB (as soon as the illegal string is witnessed). This UB can be seen in several ways, from capturing the strings (like in #375), using Lexer::remainder, etc.
The only real fix is not to take the ASCII fast path.

I've added tests to ensure this isn't hit again in the refactor 🙂.

see maciejhirsz#375 for an example of undefined behavior because of this fast path.

TLDR: the ASCII fast path will stop matching on the first matching byte,
however this would split multi-byte codepoints. Combined with
`Lexer::remaining` (or even just capturing the string like in the issue),
this leads to non-utf8 strings escaping into user code. This is UNSOUND.
@RustyYato RustyYato marked this pull request as draft February 16, 2024 02:37
@RustyYato RustyYato marked this pull request as ready for review February 16, 2024 02:46
@RustyYato
Copy link
Contributor Author

regex_syntax has a ClassUniciode::is_ascii method and ClassUnicode::maximum_len, which seem ideal for this use-case.
I think it would be better to swap to those here, but chose the smaller code-change for now.
If you think updating to regex_sytnax's versions is OK, then I can make that update as well.

@jeertmans
Copy link
Collaborator

Looks great, thanks! Sadly, benchmarks show a dramatic decrease in performances :/

So, while this is a nice fix, I would be happy not to make Logos much slower ^^'
I don't know if that is an unavoidable performance loss, or if we can do better (maybe trying what you said above).

https://github.com/maciejhirsz/logos/actions/runs/7925237413?pr=376

group                                         before                                 changes
-----                                         ------                                 -------
count_ok/identifiers                          1.00    837.5±4.71ns   887.1 MB/sec    1.02    855.6±7.21ns   868.3 MB/sec
count_ok/keywords_operators_and_punctators    1.00      2.4±0.02µs   834.2 MB/sec    1.05      2.6±0.04µs   791.0 MB/sec
count_ok/strings                              1.00   516.9±25.99ns  1607.0 MB/sec    2.66  1375.7±239.67ns   603.8 MB/sec
iterate/identifiers                           1.00   781.6±48.86ns   950.4 MB/sec    1.08    845.1±4.53ns   879.0 MB/sec
iterate/keywords_operators_and_punctators     1.00      2.5±0.04µs   818.5 MB/sec    1.00      2.5±0.04µs   820.8 MB/sec
iterate/strings                               1.00    539.2±5.50ns  1540.4 MB/sec    2.78  1501.5±14.53ns   553.2 MB/sec

@maciejhirsz
Copy link
Owner

In general there is an issue between repeating matches and non-repeating matches, as well as shared logic between the dot . and exclusion ranges like [^"].

Consider these two:

  • [^"] - this will take one byte of input and will produce UB same as .
  • [^"]+ - is currently safe as it will eat all bytes that are not ascii ", including all multibyte codes

.+ is kind of nonsensical in Logos so it's not a problem here.

Now what this PR does is fall back on full unicode checking in both cases. If you can make the check for tail end of range being 0x0010_FFFF conditional on whether or not the pattern is repeating I believe the performance of the string benchmarks would be unaffected.

@jeertmans
Copy link
Collaborator

@RustyYato I did try your suggestion. On local benchmarks, the impact on performances was much less important, so let's see.

@maciejhirsz
Copy link
Owner

maciejhirsz commented Feb 16, 2024

@RustyYato I did try your suggestion. On local benchmarks, the impact on performances was much less important, so let's see.

Just speculating here, but unicode checking expands into a lot of code (more than it needs to and if I were a better programmer it would be smarter), so this is likely to swing a lot based on your icache.

@jeertmans
Copy link
Collaborator

@maciejhirsz indeed...

Weirdly enough, the benchmarks ran with much closer timings on my computer than in GitHub CI. My previous commit does not seem to have changed the benchmarks (or maybe I missed something).

On my computer:

group                                         changes                                before                                                                                                                
-----                                         -------                                ------                                                                                                                
count_ok/identifiers                          1.05   474.3±25.25ns  1566.2 MB/sec    1.00   452.5±18.12ns  1641.9 MB/sec                                                                                   
count_ok/keywords_operators_and_punctators    1.02  1284.0±14.11ns  1582.8 MB/sec    1.00  1253.0±17.62ns  1621.9 MB/sec                                                                                   
count_ok/strings                              1.23    620.5±3.93ns  1338.8 MB/sec    1.00    503.4±7.15ns  1649.9 MB/sec                                                                                   
iterate/identifiers                           1.05   468.6±16.73ns  1585.5 MB/sec    1.00    448.1±7.56ns  1657.8 MB/sec                                                                                   
iterate/keywords_operators_and_punctators     1.04  1315.6±35.99ns  1544.7 MB/sec    1.00  1259.5±38.74ns  1613.6 MB/sec                                                                                   
iterate/strings                               1.27   635.5±21.96ns  1307.0 MB/sec    1.00   501.7±17.36ns  1655.6 MB/sec

In CI:

group                                         before                                 changes
-----                                         ------                                 -------
count_ok/identifiers                          1.02    841.2±5.39ns   883.2 MB/sec    1.00   822.4±53.44ns   903.4 MB/sec
count_ok/keywords_operators_and_punctators    1.00      2.4±0.01µs   834.5 MB/sec    1.03      2.5±0.05µs   806.5 MB/sec
count_ok/strings                              1.00   543.7±20.57ns  1527.8 MB/sec    2.74   1490.1±5.38ns   557.4 MB/sec
iterate/identifiers                           1.12   838.7±16.82ns   885.8 MB/sec    1.00    748.1±5.80ns   993.1 MB/sec
iterate/keywords_operators_and_punctators     1.00      2.5±0.02µs   814.3 MB/sec    1.05      2.6±0.02µs   777.1 MB/sec
iterate/strings                               1.00   522.0±27.49ns  1591.4 MB/sec    2.97  1552.2±15.09ns   535.2 MB/sec

@jeertmans
Copy link
Collaborator

Now what this PR does is fall back on full unicode checking in both cases. If you can make the check for tail end of range being 0x0010_FFFF conditional on whether or not the pattern is repeating I believe the performance of the string benchmarks would be unaffected.

I'll try to do that :-)

logos-codegen/src/graph/regex.rs Outdated Show resolved Hide resolved
logos-codegen/src/graph/regex.rs Show resolved Hide resolved
@jeertmans
Copy link
Collaborator

jeertmans commented Feb 16, 2024

Looks like I managed (?) to fix the performances issues, while keeping the fix :-)

group                                         before                                 changes
-----                                         ------                                 -------
count_ok/identifiers                          1.00   839.1±14.04ns   885.3 MB/sec    1.01    848.7±5.73ns   875.4 MB/sec
count_ok/keywords_operators_and_punctators    1.00      2.5±0.06µs   822.6 MB/sec    1.02      2.5±0.04µs   805.2 MB/sec
count_ok/strings                              1.00   517.8±24.07ns  1604.1 MB/sec    1.00   520.0±25.26ns  1597.5 MB/sec
iterate/identifiers                           1.00   834.2±18.57ns   890.5 MB/sec    1.01    844.3±7.16ns   879.9 MB/sec
iterate/keywords_operators_and_punctators     1.00      2.5±0.02µs   818.9 MB/sec    1.01      2.5±0.05µs   813.6 MB/sec
iterate/strings                               1.03   548.3±15.68ns  1514.9 MB/sec    1.00   533.1±24.37ns  1558.0 MB/sec

Maybe it would be worth to double-check that, @RustyYato or @maciejhirsz.

I think I might also want to put the tests to another file

Copy link
Owner

@maciejhirsz maciejhirsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from the start of the range needing to be checked as per comment.

@jeertmans
Copy link
Collaborator

LGTM aside from the start of the range needing to be checked as per comment.

Just made necessary changes in my last commits :-)

@jeertmans jeertmans merged commit 81f923c into maciejhirsz:master Feb 16, 2024
18 checks passed
@jeertmans
Copy link
Collaborator

Merging this, thank you, @RustyYato and @maciejhirsz for your help!

@RustyYato RustyYato deleted the fix-unicode-dot branch February 16, 2024 13:49
@RustyYato
Copy link
Contributor Author

RustyYato commented Feb 16, 2024

Thanks! My suggestion wouldn't have affected the runtime performance of Logos, but it looks like you managed to fix that anyways.

akrantz01 referenced this pull request in akrantz01/antsi Aug 2, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [logos](https://logos.maciej.codes/)
([source](https://togithub.com/maciejhirsz/logos)) | dependencies |
patch | `0.14.0` -> `0.14.1` |

---

### Release Notes

<details>
<summary>maciejhirsz/logos (logos)</summary>

###
[`v0.14.1`](https://togithub.com/maciejhirsz/logos/releases/tag/v0.14.1):
0.14.1 - Debug feature and fixes

#### What's Changed

- fix(doc): reset logos2 to logos by
[@&#8203;jeertmans](https://togithub.com/jeertmans) in
[https://github.com/maciejhirsz/logos/pull/372](https://togithub.com/maciejhirsz/logos/pull/372)
- chore(book): add JSON-borrowed parser example by
[@&#8203;jeertmans](https://togithub.com/jeertmans) in
[https://github.com/maciejhirsz/logos/pull/373](https://togithub.com/maciejhirsz/logos/pull/373)
- Add Rc<T> and Arc<T> sources by
[@&#8203;InfiniteCoder01](https://togithub.com/InfiniteCoder01) in
[https://github.com/maciejhirsz/logos/pull/340](https://togithub.com/maciejhirsz/logos/pull/340)
- Fix unicode dot by [@&#8203;RustyYato](https://togithub.com/RustyYato)
in
[https://github.com/maciejhirsz/logos/pull/376](https://togithub.com/maciejhirsz/logos/pull/376)
- chore(docs): cleanup examples by
[@&#8203;jeertmans](https://togithub.com/jeertmans) in
[https://github.com/maciejhirsz/logos/pull/381](https://togithub.com/maciejhirsz/logos/pull/381)
- chore(lib): add debug feature by
[@&#8203;jeertmans](https://togithub.com/jeertmans) in
[https://github.com/maciejhirsz/logos/pull/382](https://togithub.com/maciejhirsz/logos/pull/382)
- Cleanup unused Source features by
[@&#8203;kmicklas](https://togithub.com/kmicklas) in
[https://github.com/maciejhirsz/logos/pull/335](https://togithub.com/maciejhirsz/logos/pull/335)
- chore(deps): bump peaceiris/actions-mdbook from 1 to 2 by
[@&#8203;dependabot](https://togithub.com/dependabot) in
[https://github.com/maciejhirsz/logos/pull/387](https://togithub.com/maciejhirsz/logos/pull/387)
- Fix `Lexer::clone` leak and UB + tests by
[@&#8203;Jakobeha](https://togithub.com/Jakobeha) in
[https://github.com/maciejhirsz/logos/pull/390](https://togithub.com/maciejhirsz/logos/pull/390)
- fix(lib): correctly handle miss for loop in loop by
[@&#8203;lukas-code](https://togithub.com/lukas-code) in
[https://github.com/maciejhirsz/logos/pull/393](https://togithub.com/maciejhirsz/logos/pull/393)
- chore(lib): remove error branch from LUT if it is unreachable by
[@&#8203;RustyYato](https://togithub.com/RustyYato) in
[https://github.com/maciejhirsz/logos/pull/386](https://togithub.com/maciejhirsz/logos/pull/386)
- fix(docs): typo by
[@&#8203;joerivanruth](https://togithub.com/joerivanruth) in
[https://github.com/maciejhirsz/logos/pull/396](https://togithub.com/maciejhirsz/logos/pull/396)
- chore(docs): Adds graph debug documentation to book by
[@&#8203;afreeland](https://togithub.com/afreeland) in
[https://github.com/maciejhirsz/logos/pull/379](https://togithub.com/maciejhirsz/logos/pull/379)
- chore: drop python linting frmo pre-commit-config by
[@&#8203;LeoDog896](https://togithub.com/LeoDog896) in
[https://github.com/maciejhirsz/logos/pull/403](https://togithub.com/maciejhirsz/logos/pull/403)
- refactor: don't use deprecated max_value() method by
[@&#8203;LeoDog896](https://togithub.com/LeoDog896) in
[https://github.com/maciejhirsz/logos/pull/404](https://togithub.com/maciejhirsz/logos/pull/404)
- chore(version): bump logos version to 0.14.1 by
[@&#8203;jeertmans](https://togithub.com/jeertmans) in
[https://github.com/maciejhirsz/logos/pull/409](https://togithub.com/maciejhirsz/logos/pull/409)
- fix(docs): change old 0.14.0 by
[@&#8203;jeertmans](https://togithub.com/jeertmans) in
[https://github.com/maciejhirsz/logos/pull/410](https://togithub.com/maciejhirsz/logos/pull/410)

#### New Contributors

- [@&#8203;InfiniteCoder01](https://togithub.com/InfiniteCoder01) made
their first contribution in
[https://github.com/maciejhirsz/logos/pull/340](https://togithub.com/maciejhirsz/logos/pull/340)
- [@&#8203;RustyYato](https://togithub.com/RustyYato) made their first
contribution in
[https://github.com/maciejhirsz/logos/pull/376](https://togithub.com/maciejhirsz/logos/pull/376)
- [@&#8203;Jakobeha](https://togithub.com/Jakobeha) made their first
contribution in
[https://github.com/maciejhirsz/logos/pull/390](https://togithub.com/maciejhirsz/logos/pull/390)
- [@&#8203;lukas-code](https://togithub.com/lukas-code) made their first
contribution in
[https://github.com/maciejhirsz/logos/pull/393](https://togithub.com/maciejhirsz/logos/pull/393)
- [@&#8203;joerivanruth](https://togithub.com/joerivanruth) made their
first contribution in
[https://github.com/maciejhirsz/logos/pull/396](https://togithub.com/maciejhirsz/logos/pull/396)
- [@&#8203;afreeland](https://togithub.com/afreeland) made their first
contribution in
[https://github.com/maciejhirsz/logos/pull/379](https://togithub.com/maciejhirsz/logos/pull/379)
- [@&#8203;LeoDog896](https://togithub.com/LeoDog896) made their first
contribution in
[https://github.com/maciejhirsz/logos/pull/403](https://togithub.com/maciejhirsz/logos/pull/403)

**Full Changelog**:
maciejhirsz/logos@v0.14...v0.14.1

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View the
[repository job log](https://developer.mend.io/github/akrantz01/antsi).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40NDAuNyIsInVwZGF0ZWRJblZlciI6IjM3LjQ0MC43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Undefined behaviour using . matcher on unicode
3 participants