Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex-utils: Add support for translating regex character sets into wildcards when possible. #493

Merged
merged 7 commits into from
Jul 30, 2024

Conversation

Bill-hbrhbr
Copy link
Contributor

Description

Implements the following functionality:

  • Reduces a character set into a single character if possible.
    • A trivial character set containing a single character or a single escaped metacharacter.
      • E.g. [a] into a, [\^] into ^
    • If the case_insensitive_wildcard config is turned on, the translator can also reduce the
      case-insensitive style character set patterns into a single lowercase character:
      • E.g. [aA] into a, [Bb] into b, [xX][Yy][zZ] into xyz

Validation performed

Verified in unit tests

@Bill-hbrhbr Bill-hbrhbr changed the title Implement translator logic to translate regex character sets into wildcards by attempting to reduce them to a single characters. Implement translator logic to translate regex character sets into wildcards by attempting to reduce them to single characters. Jul 25, 2024
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before going to the detailed review, I had some comments about the way of tracking the charset. In the current implementation, we have an iterator in the TranslatorState and expose APIs to update/query the underlying iterator that points to the beginning of the charset. However, it's a little confusing and unsafe:

  1. The constructor also takes an iterator, which doesn't make sense if it is not the beginning of the charset.
  2. The names of m_it, get_marked_iterator, and mark_iterator don't really indicate their usage.
  3. When querying an iterator, we don't have a sanity check of whether it is set to the beginning of a charset properly.

I would propose a solution which we can add a dedicated class to track the status of charset escape. It should have a counter of chars processed (indicating the length of the charset), a flag indicating whether it's an escape char or not, and also the storage of the character to proceed. The TranslatorState should have a std::optional member of such a charset status tracker, and emplace the optional tracker when [ is found and reset the optional tracker when ] is found. All the checks can be passed down to the charset tracker's member function to proceed when you see the character in charset_state_transition, and you don't need the chaining if-else with the length check. Defininig sub-states inside this charset tracker class might be useful. I think that would make the interface more clean and readable.

@davidlion davidlion changed the title Implement translator logic to translate regex character sets into wildcards by attempting to reduce them to single characters. regex-utils: Add support for translating regex character sets into wildcards when possible. Jul 30, 2024
@Bill-hbrhbr Bill-hbrhbr merged commit 09fb0b7 into y-scope:main Jul 30, 2024
12 checks passed
@Bill-hbrhbr Bill-hbrhbr deleted the regex-utils-charset branch July 30, 2024 05:12
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants