Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for creating a universal regex translator #27

Open
slevithan opened this issue Jan 9, 2025 · 0 comments
Open

Ideas for creating a universal regex translator #27

slevithan opened this issue Jan 9, 2025 · 0 comments

Comments

@slevithan
Copy link
Owner

slevithan commented Jan 9, 2025

This project (oniguruma-to-es) demonstrates that accurately/comprehensively converting between any two regex flavors is a hard task, and much more complicated than many people assume. (However, it should be noted that many conversions are not nearly as hard to get right as going from Oniguruma to JavaScript; e.g., comprehensively translating from JavaScript to Oniguruma would be dramatically easier, though not without its own challenges.)

Many people have imagined a universal regex translator that can convert any popular regex flavor to any other. The idea is relatively common in developer discussions about regexes, though usually by people who aren't aware of just how complex and different regex edge cases and advanced features can be. But there are no such projects I'm aware of that would be suitable for many use cases.

Existing regex translators

The closest existing things that come to mind are:

  • JGsoft tools like RegexBuddy, by Jan Goyvaerts. Jan is probably the world's foremost expert on differences between the world's regex flavors, and he was my coauthor on Regular Expressions Cookbook.
    • However, Jan's systems are proprietary, Windows-only (unless e.g. running on Linux using Wine), do not offer programmatic regex translation, and are available only via JSsoft applications like RegexBuddy (at cost). And despite being extremely good at understanding the differences between regex flavors, they also have their bugs and limitations (around edge cases and advanced features), are better at interpretation than translation (they only do direct feature mapping, and warn about but don't translate edge case differences), and can't benefit from open source contribution/extension.
  • Oniguruma itself has extensive compile-time options/flags that allow it to interpret regular expressions with various differences found across regex flavors like Python, Ruby, Perl, Java, etc.
    • However, this is limited in its flavor coverage and doesn't offer comprehensive emulation. It also does not offer translation between flavors, so it only works within Oniguruma.
  • Various other regex translators exist (none of which are broadly used today, AFAIK) but they are, as a rule, not rigorous (e.g. they don't handle various edge cases), don't cover advanced features well or at all, and can't be extended by third party libraries to add additional flavors.
    • One such tool from the JavaScript world is regex-translator, but unfortunately it hasn't been updated in years, isn't scalable in its design, and isn't designed for full rigor/accuracy in its interpretation of flavors or in its translation between flavors. It would not be usable as a base for a project like oniguruma-to-es even if it supported Oniguruma (which it doesn't).

If there are other existing high-quality tools for this, please let me know!

Given the above, there is an opportunity to create just such a tool.

It wouldn't have to start out as a huge project, but using and building on it in oniguruma-to-es would be a pretty significant rewrite, so I'd need to make the decision to work on this thoughtfully. If any collaborators were eager to work together on such a project, that would help accelerate the decision to start the work. And if anyone wanted to build something like this on their own, I'd be rooting for them!

High-level ideas

  • I imagine the foundation would be a set of lightweight functions for constructing a generic regex AST (MetaRegexAst) that could be used for any-to-any flavor transformations.
    • In the initial version, it could be limited to AST node types needed by oniguruma-to-es to support the Oniguruma flavor, but they would be designed with knowledge of differences in the world's regex flavors that would allow for future, expanded support.
  • If, for any given regex flavor, you wrote a parser that outputs a MetaRegexAst, all transformers/generators that work with MetaRegexAst would then be able to translate from your new flavor. Conversely, if you wrote a transformer and/or generator that takes a MetaRegexAst as input, you could translate to your regex flavor from any flavor with an existing MetaRegexAst-compatible parser.
  • Once a meaningful set of flavor-to-MetaRegexAst parsers and MetaRegexAst-to-flavor generators were built, this would then be useful in a significant number of tools. As just a few examples:
    • Regex testers like regex101.com could offer translation between flavors, and offer (emulated) support for testing with flavors that they don't have an existing backend for.
    • Search tools (including grep-like tools) could let you choose your regex flavor without having to bake in multiple regex engines.
    • Code editors/extensions could let you paste a regex from one language and translate it to another, or see what your regex would look like in another flavor. As a sub-example, tools for TextMate grammar authors could allow JS developers to write regexes in JS and have them auto-converted to the same meaning in Oniguruma.
    • Projects that allow you to run code from one language in another could offer support for regexes without (poorly) reinventing the wheel.
  • The MetaRegexAst format should support regex flavor versions (for both input and output), but this probably doesn't need to be built into MetaRegexAst itself. However, in order to support that, the Unicode version would need to be specifiable in the MetaRegexAst.
  • To ensure usefulness in many use cases, it should support raw and pos properties on MetaRegexAst nodes, which oniguruma-to-es's OnigurumaAst format doesn't currently include/support (since it is purpose-built for oniguruma-to-es's use case).
  • Remaining lightweight (or very tree-shakable) would be critical so that projects that build on the library (like oniguruma-to-es) could maintain small bundle sizes for use in browsers.

Use in oniguruma-to-es

  • The Oniguruma tokenizer and parser could be removed from this library and moved to the universal regex translator library, with the parser producing a MetaRegexAst rather than the current OnigurumaAst.
  • The OnigurumaAst format (returned by toOnigurumaAst) would require a new transformer to translate from the MetaRegexAst format. However, that step could be bypassed by the current OnigurumaAst-to-RegexAst transformer, since it could go directly from MetaRegexAst.
    • Perhaps toOnigurumaAst could be removed from this library (to avoid the need for an additional and otherwise-unnecessary transformer). For people relying on it, a MetaRegexAst-to-OnigurumaAst transformer could easily be added to the universal regex translator library.
  • A universal regex translator could help fill the significant current gap of JavaScript (and/or Regex+) to Oniguruma translation, rather than just going from Oniguruma to JavaScript.
    • Creating a Regex+ to MetaRegexAst parser would also provide a JS-RegExp-with-flag-v parser essentially for free, since Regex+ syntax is a strict superset of JS-RegExp-with-v.

If this is something you would use or want to collaborate on, let me know or add a thumbs up. Also, any thoughts on how to approach or improve on the basic ideas here would be very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant