regex-utils: Add support for translating regex character sets into wildcards when possible. #493

Bill-hbrhbr · 2024-07-25T05:38:41Z

Description

Implements the following functionality:

Reduces a character set into a single character if possible.
- A trivial character set containing a single character or a single escaped metacharacter.
  - E.g. [a] into a, [\^] into ^
- If the case_insensitive_wildcard config is turned on, the translator can also reduce the
  case-insensitive style character set patterns into a single lowercase character:
  - E.g. [aA] into a, [Bb] into b, [xX][Yy][zZ] into xyz

Validation performed

Verified in unit tests

…rds.

LinZhihao-723

Before going to the detailed review, I had some comments about the way of tracking the charset. In the current implementation, we have an iterator in the TranslatorState and expose APIs to update/query the underlying iterator that points to the beginning of the charset. However, it's a little confusing and unsafe:

The constructor also takes an iterator, which doesn't make sense if it is not the beginning of the charset.
The names of m_it, get_marked_iterator, and mark_iterator don't really indicate their usage.
When querying an iterator, we don't have a sanity check of whether it is set to the beginning of a charset properly.

I would propose a solution which we can add a dedicated class to track the status of charset escape. It should have a counter of chars processed (indicating the length of the charset), a flag indicating whether it's an escape char or not, and also the storage of the character to proceed. The TranslatorState should have a std::optional member of such a charset status tracker, and emplace the optional tracker when [ is found and reset the optional tracker when ] is found. All the checks can be passed down to the charset tracker's member function to proceed when you see the character in charset_state_transition, and you don't need the chaining if-else with the length check. Defininig sub-states inside this charset tracker class might be useful. I think that would make the interface more clean and readable.

components/core/src/clp/regex_utils/regex_translation_utils.cpp

Co-authored-by: davidlion <[email protected]>

…t position.

components/core/src/clp/regex_utils/regex_translation_utils.cpp

…ldcards when possible. (y-scope#493)

Bill-hbrhbr added 3 commits July 25, 2024 01:29

Implement translator logic to reduce regex character sets into wildca…

e74f043

…rds.

Update README

148c751

Fix comment

a5090c1

Bill-hbrhbr requested review from davidlion and LinZhihao-723 July 25, 2024 05:38

LinZhihao-723 requested changes Jul 25, 2024

View reviewed changes

davidlion requested changes Jul 26, 2024

View reviewed changes

Bill-hbrhbr and others added 2 commits July 27, 2024 01:13

Apply suggestions from code review

6530eea

Co-authored-by: davidlion <[email protected]>

Use std::optional to store the iterator that marks regex charset star…

185b6a7

…t position.

LinZhihao-723 requested changes Jul 28, 2024

View reviewed changes

Address code review on std::optional usage

f92026d

Bill-hbrhbr requested review from LinZhihao-723 and davidlion July 30, 2024 03:34

Fix naming convensions

a5d1648

davidlion changed the title ~~Implement translator logic to translate regex character sets into wildcards by attempting to reduce them to single characters.~~ regex-utils: Add support for translating regex character sets into wildcards when possible. Jul 30, 2024

davidlion approved these changes Jul 30, 2024

View reviewed changes

LinZhihao-723 approved these changes Jul 30, 2024

View reviewed changes

Bill-hbrhbr merged commit 09fb0b7 into y-scope:main Jul 30, 2024
12 checks passed

Bill-hbrhbr deleted the regex-utils-charset branch July 30, 2024 05:12

jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024

regex-utils: Add support for translating regex character sets into wi…

1c6c925

…ldcards when possible. (y-scope#493)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex-utils: Add support for translating regex character sets into wildcards when possible. #493

regex-utils: Add support for translating regex character sets into wildcards when possible. #493

Bill-hbrhbr commented Jul 25, 2024

LinZhihao-723 left a comment •

edited

Loading

regex-utils: Add support for translating regex character sets into wildcards when possible. #493

regex-utils: Add support for translating regex character sets into wildcards when possible. #493

Conversation

Bill-hbrhbr commented Jul 25, 2024

Description

Validation performed

LinZhihao-723 left a comment • edited Loading

Choose a reason for hiding this comment

LinZhihao-723 left a comment •

edited

Loading