Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres-compatible REGEXP_REPLACE function #5564

Open
philrz opened this issue Jan 9, 2025 · 0 comments
Open

Postgres-compatible REGEXP_REPLACE function #5564

philrz opened this issue Jan 9, 2025 · 0 comments

Comments

@philrz
Copy link
Contributor

philrz commented Jan 9, 2025

tl;dr

The REGEXP_REPLACE function is called the same way in ClickBench query 28 with Postgres, DuckDB, and ClickHouse. However, if we attempt the same call with SuperDB's REGEXP_REPLACE function it currently causes a parse error and significant adjustments need to be made to get it to return the same query result.

Strictly speaking, it appears REGEXP_REPLACE is not a formal part of the SQL spec. However, we've got a general goal to make SuperDB as Postgres-compatible as we can so users can see it as a viable drop-in replacement, such as for running their existing BI queries. I can see from the Postgres docs that their parameters cover even more ground than what's used by this query. Below I'll share some details about my experience as a user getting this one to work.

Details

Repro is with super commit 1d783cc.

To simplify, I'll use the attached referer.csv test data:

Referer
https://go.mail/folder-1/online/ru-en/#lingvo/#1О 50000&price_ashka/rav4/page=/check.xml
http://tambov.irr.ru/cgi-bin/news/?page.aspx
http://state=19945206/foto-4/login=Ivan_contazhigulevaolgdruzhy
http://www.circles/3973131509006588934

Here's DuckDB returning the expected result using the same REGEXP_REPLACE call as in the ClickBench query 28.

$ duckdb --version
v1.1.3 19864453f7

$ duckdb -c "SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') FROM 'referer.csv'"
┌──────────────────────────────────────────────────────────────────┐
│ regexp_replace(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') │
│                             varchar                              │
├──────────────────────────────────────────────────────────────────┤
│ go.mail                                                          │
│ tambov.irr.ru                                                    │
│ state=19945206                                                   │
│ circles                                                          │
└──────────────────────────────────────────────────────────────────┘

And here's the parse error from SuperDB if I attempt to execute the same query.

$ super -version
Version: v1.18.0-215-g1d783cc2

$ super -c "SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') FROM 'referer.csv' FORMAT csv"
parse error at line 1, column 50:
SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') FROM 'referer.csv' FORMAT csv
                                             === ^ ===

One way I got it to work is by surrounding the regexp in /, but this required:

  1. Adding a \ before each attempt to match a literal / inside the regexp
  2. Changing the reference to the substring match of the the parenthesized subexpression from \1 to \$1
$ super -c "SELECT REGEXP_REPLACE(Referer, /^https?:\/\/(?:www\.)?([^\/]+)\/.*$/, '\$1') FROM 'referer.csv' FORMAT csv"
{regexp_replace:"go.mail"}
{regexp_replace:"tambov.irr.ru"}
{regexp_replace:"state=19945206"}
{regexp_replace:"circles"}

I also was able to get it working by keeping the regexp as a string, but this required adding a \\ escape at the location of the parse error in addition to still making the \1 to \$1 change.

$ super -c "SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\\\.)?([^/]+)/.*$', '\$1') FROM 'referer.csv' FORMAT csv"
{regexp_replace:"go.mail"}
{regexp_replace:"tambov.irr.ru"}
{regexp_replace:"state=19945206"}
{regexp_replace:"circles"}

I don't claim perfect knowledge as to the history of the use of / as a regexp delimeter, but Wikipedia notes its common use with sed, Perl, and I'm also familiar with it from JavaScript. Meanwhile, it seems like all the major SQL implementations I can spot show just single-quoted strings. Likewise, it seems like the \1 syntax is consistently favored for referencing a match of a parenthesized subexpression. Therefore I expect we'd want to adopt these conventions also/instead in the pursuit of SQL compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant