Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regex parser for parsing functions having SQL body with language sql (PG 15 feature) #2201

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

priyanshi-yb
Copy link
Contributor

Describe the changes in this pull request

SQL body syntax - https://www.postgresql.org/docs/15/sql-createfunction.html#:~:text=a%20new%20session.-,sql_body,-The%20body%20of, where with language SQL

BEGIN ATOMIC;
....
END;

or

language sql
RETURN ...;

Describe if there are any user-facing changes

N/A

How was this pull request tested?

unit tests added and end-to end tests added in other PR - #2165

Does your PR have changes that can cause upgrade issues?

Component Breaking changes?
MetaDB No
Name registry json No
Data File Descriptor Json No
Export Snapshot Status Json No
Import Data State No
Export Status Json No
Data .sql files of tables No
Export and import data queue No
Schema Dump No
AssessmentDB No
Sizing DB No
Migration Assessment Report Json No
Callhome Json No
YugabyteD Tables No
TargetDB Metadata Tables No

Comment on lines 924 to 926
} else if strings.Contains(currLine, "BEGIN ATOMIC") {
dollarQuoteFlag = 1 //denotes start of the sql body part https://www.postgresql.org/docs/15/sql-createfunction.html#:~:text=a%20new%20session.-,sql_body,-The%20body%20of
codeBlockDelimiter = "END"
Copy link
Collaborator

@sanyamsinghal sanyamsinghal Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current code will not be covering the below cases:

  1. There are chances of BEGIN ATOMIC or END being on same line, different line, something on same line and rest on the next line.
CREATE OR REPLACE FUNCTION subtract_and_log(a INT, b INT)
RETURNS INT
LANGUAGE SQL
BEGIN ATOMIC INSERT INTO test_log(action) VALUES ('Subtracting numbers'); RETURN a - b; END;
  1. Posibility of different casing(smallcase, mixed case) used for BEGIN ATOMIC ... END;

  2. Using regexp here would be right way to go here, given there are N number of posibilities.

Copy link
Contributor Author

@priyanshi-yb priyanshi-yb Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are chances of BEGIN ATOMIC or END being on same line, different line, something on same line and rest on the next line.

This automatically works with the current code, added unit test case for that.

Posibility of different casing(smallcase, mixed case) used for BEGIN ATOMIC ... END;

Fixed the casing by using the BEGIN ATOMIC with regexp but END with string lower upper both

Not trying to cover a lot of cases given that we should use the pg-parser anyways but having this meanwhile to atleast able to detect the SQL body case for this PR #2165

Copy link
Collaborator

@sanyamsinghal sanyamsinghal Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt we are going to have pg-parser soon for breaking the sqlfile into sql statements.
Also for oracle cases where syntax is off, (not fully converted to PG), there the parser will fail, so we could either use current approach or mix of both.

I would say lets cover most of it if we can with small efforts.
Since this SQL body thing is a part of DDL, and our import schema can also break.

Copy link
Contributor Author

@priyanshi-yb priyanshi-yb Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the SQL BODY, I have tried to cover most of the cases on the PG docs https://www.postgresql.org/docs/11/sql-createfunction.html (fixed one panic in IsEndOFSQL function). The only todo is that I am using the ToLOWER and END for codeDelimiter for SQL where we could use regex but should be okay I think.
do you have any other case in mind that I might be missing?

@priyanshi-yb priyanshi-yb force-pushed the priyanshi/fix-regex-parser branch from 5595c10 to f4e04a9 Compare January 21, 2025 09:34
@priyanshi-yb priyanshi-yb force-pushed the priyanshi/fix-regex-parser branch from a7abfab to ff7fb07 Compare January 21, 2025 13:23
@@ -134,6 +134,7 @@ var (
parserIssueDetector = queryissue.NewParserIssueDetector()
multiRegex = regexp.MustCompile(`([a-zA-Z0-9_\.]+[,|;])`)
dollarQuoteRegex = regexp.MustCompile(`(\$.*\$)`)
sqlBodyBeginRegex = re("BEGIN", "ATOMIC")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put a ^ in the regex to ensure that BEGIN ATOMIC is at the start of line?
Do we need to ensure that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to ensure that as it could be anywhere not necessarily in starting of the line e.g.

CREATE FUNCTION public.asterisksdfsf(n integer) RETURNS SETOF text
    LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE BEGIN ATOMIC
 SELECT repeat('*'::text, g.g) AS repeat
    FROM generate_series(1, asterisksdfsf.n) g(g);
END;

Comment on lines 929 to 930
if strings.Contains(currLine, codeBlockDelimiter) ||
strings.Contains(currLine, strings.ToLower(codeBlockDelimiter)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern here is that we don't know the variety of cases possible with plpgsql, since it can almost have anything inside the body. Here the condition is getting changed to match with either same or lowercase variant of it.

Should we just handle the sqlbody func case separately(by that i mean having different if conditions here for that) and not update the existing ones.
Just want to avoid any unknowns

Copy link
Contributor Author

@priyanshi-yb priyanshi-yb Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the condition is getting changed to match with either same or lowercase variant of it.

the same or lowercase variant is only of the codeBlockDelimiter variable which could be dollarQuoteRegex = regexp.MustCompile((\$.*\$)) match or the END string for sqlBody.

But okay, I get your point about any unknowns with lowercase variant of the dollarQuoteRegex case, I can do that in separate if condition for just END case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +931 to +932
if strings.Contains(currLine, codeBlockDelimiter) ||
strings.Contains(currLine, strings.ToLower(codeBlockDelimiter)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but can we use strings.Equalfold() here to compare both and dont care about casing(covering mixed case scenario also).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants