-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex-utils: Add support for translating regex character sets into wildcards when possible. #493
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before going to the detailed review, I had some comments about the way of tracking the charset. In the current implementation, we have an iterator in the TranslatorState
and expose APIs to update/query the underlying iterator that points to the beginning of the charset. However, it's a little confusing and unsafe:
- The constructor also takes an iterator, which doesn't make sense if it is not the beginning of the charset.
- The names of
m_it
,get_marked_iterator
, andmark_iterator
don't really indicate their usage. - When querying an iterator, we don't have a sanity check of whether it is set to the beginning of a charset properly.
I would propose a solution which we can add a dedicated class to track the status of charset escape. It should have a counter of chars processed (indicating the length of the charset), a flag indicating whether it's an escape char or not, and also the storage of the character to proceed. The TranslatorState
should have a std::optional
member of such a charset status tracker, and emplace the optional tracker when [
is found and reset the optional tracker when ]
is found. All the checks can be passed down to the charset tracker's member function to proceed when you see the character in charset_state_transition
, and you don't need the chaining if-else
with the length check. Defininig sub-states inside this charset tracker class might be useful. I think that would make the interface more clean and readable.
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
Co-authored-by: davidlion <[email protected]>
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/regex_utils/regex_translation_utils.cpp
Outdated
Show resolved
Hide resolved
…ldcards when possible. (y-scope#493)
Description
Implements the following functionality:
[a]
intoa
,[\^]
into^
case_insensitive_wildcard
config is turned on, the translator can also reduce thecase-insensitive style character set patterns into a single lowercase character:
[aA]
intoa
,[Bb]
intob
,[xX][Yy][zZ]
intoxyz
Validation performed
Verified in unit tests