SNOW-1432019 Calculate subtree query complexity #1657

sfc-gh-aalam · 2024-05-22T22:47:57Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1432019
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
Please describe how your code solves the related issue.

see doc: https://docs.google.com/document/d/1IS8qyNmWecF_Lej723hlqXXKUVceATnfeBziyWbFlh0/edit

github-actions · 2024-05-30T14:50:55Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-05-30T15:07:23Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

…b.com:snowflakedb/snowpark-python into aalam-SNOW-1432019-calculate-candidacy-scores

github-actions · 2024-05-30T16:15:54Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-04T17:48:24Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-04T20:41:59Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-04T21:49:31Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-04T22:25:04Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-jdu

Neat!

sfc-gh-jdu · 2024-06-04T23:52:34Z

src/snowflake/snowpark/_internal/analyzer/binary_expression.py

 from typing import AbstractSet, Optional

+# collections.Counter does not pass type checker. Changes with appropriate type hints were made in 3.9+
+if sys.version_info <= (3, 9):


Suggested change

if sys.version_info <= (3, 9):

if sys.version_info < (3, 9):

?

Can we also have this type annotation in one util file and we can import from that place?

I can move it to a util file - that's a good idea.

sfc-gh-jdu · 2024-06-05T00:00:12Z

src/snowflake/snowpark/_internal/analyzer/expression.py

@@ -80,6 +96,18 @@ def sql(self) -> str:
        )
        return f"{self.pretty_name}({children_sql})"

+    @property


Can we add some comments here or in complexity_stat.py to brief talk about the main idea of complexity score calculation (e.g., use # of columns involved as a proxy, etc.)?

sfc-gh-jdu · 2024-06-05T00:03:54Z

src/snowflake/snowpark/_internal/analyzer/select_statement.py

+    @property
+    def individual_complexity_stat(self) -> Counter[str]:
+        # SELECT * FROM entity
+        return Counter({ComplexityStat.COLUMN.value: 1})


* is regarded as one column instead of all columns?

yes. in terms of compiling complexity * is easier than col1, ... col100. So we need to reflect that

That is not exactly true, * will be easier for parsing, but once it passed the star expansion stage, it is literally the same as putting all columns there. However, when it is just star expression, it might make the some of the stages little bit easier, like unnesting etc. I think we can start with 1 for now, things can be adjusted eventually

src/snowflake/snowpark/_internal/analyzer/select_statement.py

sfc-gh-jdu · 2024-06-05T00:12:03Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan_node.py

+    def individual_complexity_stat(self) -> Counter[str]:
+        # select $1, ..., $m FROM VALUES (r11, r12, ..., r1m), (rn1, ...., rnm)
+        # TODO: use ARRAY_BIND_THRESHOLD
+        return Counter(


ah then the complexity of create_dataframe will be very high compared with select from a table. If the table is large, then at least execution of selecting from a table will be more "complex"?

that's right. even if data created from create_dataframe may be smaller than a table and thus be less expensive in terms of execution, this complexity is only concerned with compiler complexity since we won't be able to estimate execution complexity without more data.

I think this is fine for solving compilation issues, but we should consider data size and other factors when choosing no action/cte/materialization for eliminating repeated subqueries, right?

@sfc-gh-aalam i think @sfc-gh-jdu got a good point here, and i recall we typically just use variable binding for this, and if data is coming from client side, i assume it could be small, maybe we can simply count the column, and have a VALUE category, and set 1 for value

sfc-gh-jdu · 2024-06-05T00:15:22Z

src/snowflake/snowpark/_internal/analyzer/unary_plan_node.py

+            estimate += Counter({ComplexityStat.LOW_IMPACT.value: 1})
+
+        get_complexity_stat = (
+            lambda expr: expr.cumulative_complexity_stat


Suggested change

lambda expr: expr.cumulative_complexity_stat

getattr(expr, "cumulative_complexity_stat", Counter({ComplexityStat.COLUMN.value: 1}))

and no function is needed?

yep. updated

sfc-gh-jdu · 2024-06-05T00:17:43Z

src/snowflake/snowpark/_internal/server_connection.py

@@ -656,7 +656,7 @@ def get_result_and_metadata(

    def get_result_query_id(self, plan: SnowflakePlan, **kwargs) -> str:
        # get the iterator such that the data is not fetched
-        result_set, _ = self.get_result_set(plan, to_iter=True, **kwargs)
+        result_set, _ = self.get_result_set(plan, ignore_results=True, **kwargs)


this might be in another PR

yeah. this was a mistake. thanks for catching it

github-actions · 2024-06-06T02:49:38Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-06T15:07:37Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-06T15:43:39Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-07T17:08:42Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-jdu · 2024-06-07T20:58:20Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan_node.py

+    def individual_complexity_stat(self) -> Counter[str]:
+        # select $1, ..., $m FROM VALUES (r11, r12, ..., r1m), (rn1, ...., rnm)
+        # TODO: use ARRAY_BIND_THRESHOLD
+        return Counter(


I think this is fine for solving compilation issues, but we should consider data size and other factors when choosing no action/cte/materialization for eliminating repeated subqueries, right?

sfc-gh-jdu · 2024-06-07T20:59:02Z

src/snowflake/snowpark/_internal/analyzer/binary_plan_node.py

@@ -187,3 +194,23 @@ def __init__(
    @property
    def sql(self) -> str:
        return self.join_type.sql
+
+    @property
+    def individual_complexity_stat(self) -> Counter[str]:


nit: I would prefer score instead of stat

I am renaming this to individual_node_complexity and cumulative_node_complexity. since the return value is a dictionary, adding score could also be misleading.

so when i look at those, those are just plan_statistic or plan_info to me, but
I am fine with node_complexity if preferred. score is definitely misleading since we are not really calculating any score here yet

I earlier had stat as a short hand for statistic

sfc-gh-jdu · 2024-06-07T21:01:05Z

tests/integ/scala/test_async_job_suite.py

@@ -349,13 +351,21 @@ def test_async_batch_insert(session):
    reason="TODO(SNOW-932722): Cancel query is not allowed in stored proc",
 )
 def test_async_is_running_and_cancel(session):
-    async_job = session.sql("select SYSTEM$WAIT(3)").collect_nowait()
+    # creating a sproc here because describe query on SYSTEM$WAIT()


the commit is checked in accidentally?

no. this is another one of the case where describe query is launched. in this case, internal describe query on SELECT$WAIT actually blocks the query

I see, but I mean should it be in another PR?

1 this change seems irrelevant to this pr

I will revert this change and fix SelectSQL implementation so we do not trigger analyze attributes for it.

src/snowflake/snowpark/_internal/analyzer/expression.py

src/snowflake/snowpark/_internal/analyzer/binary_plan_node.py

sfc-gh-yzou · 2024-06-07T23:33:28Z

src/snowflake/snowpark/_internal/analyzer/binary_plan_node.py

+        # SELECT * FROM (left) AS left_alias join_type_sql JOIN (right) AS right_alias match_cond, using_cond, join_cond
+        estimate = Counter({ComplexityStat.JOIN.value: 1})
+        if isinstance(self.join_type, UsingJoin) and self.join_type.using_columns:
+            estimate += Counter(


i think I am still confused here. The join seems a LogicalPlan node, if we do not plan to do it at the plan node level, where do we plan to calculate the accumulated complexity for join then?

sfc-gh-yzou · 2024-06-07T23:34:21Z

tests/integ/scala/test_async_job_suite.py

@@ -349,13 +351,21 @@ def test_async_batch_insert(session):
    reason="TODO(SNOW-932722): Cancel query is not allowed in stored proc",
 )
 def test_async_is_running_and_cancel(session):
-    async_job = session.sql("select SYSTEM$WAIT(3)").collect_nowait()
+    # creating a sproc here because describe query on SYSTEM$WAIT()


1 this change seems irrelevant to this pr

sfc-gh-yzou · 2024-06-07T23:39:40Z

src/snowflake/snowpark/_internal/analyzer/binary_expression.py

+
+    KT = typing.TypeVar("KT")
+
+    class Counter(collections.Counter, typing.Counter[KT]):


hmm, if map1 + map2 is the only difference, can we simply use map then? Right now the returned type is Counter[str] which is not straight forward to understand with a str, for example, why complexity is represented as a string? to reduce unnecessary confusion in the code, let's use more obvious data structure for representation

sfc-gh-yzou · 2024-06-07T23:42:15Z

src/snowflake/snowpark/_internal/analyzer/complexity_stat.py

+    COLUMN = "column"
+    FUNCTION = "function"
+    IN = "in"
+    LOW_IMPACT = "low_impact"


i think we can have both, then we have a category for each node in the plan tree node, it would be easy for us to calculate the total number of nodes in the tree

sfc-gh-yzou · 2024-06-07T23:44:35Z

@sfc-gh-aalam thanks for accommodate the comments. the overall flow looks good to me, but i think i still have some major questions that i am kind of confused? especially the individual_complexity seems not so useful to me, maybe let's go over the major questions on Monday's meeting very quickly

…ssion

sfc-gh-yzou · 2024-06-12T22:42:17Z

src/snowflake/snowpark/_internal/analyzer/query_plan_analysis_utils.py

+    OTHERS = "others"
+
+
+def sum_node_complexities(*node_complexities: Dict[str, int]) -> Dict[str, int]:


in fact, can the complicity just be Dict[PlanNodeCategory, int]

sfc-gh-yzou · 2024-06-12T22:44:31Z

src/snowflake/snowpark/_internal/analyzer/query_plan_analysis_utils.py

+    OTHERS = "others"
+
+
+def sum_node_complexities(*node_complexities: Dict[str, int]) -> Dict[str, int]:


Let's add a comment for this function. for example: this is a helper function for summing the complexity for all given node complexity, the node complicity is represented as a mapping between PlanNodeCategory, and the total count of the corresponding category

sfc-gh-yzou · 2024-06-12T22:46:35Z

src/snowflake/snowpark/_internal/analyzer/select_statement.py

@@ -658,6 +688,61 @@ def schema_query(self) -> str:
    def children_plan_nodes(self) -> List[Union["Selectable", SnowflakePlan]]:
        return [self.from_]

+    @property
+    def individual_node_complexity(self) -> Dict[str, int]:
+        score = {}


let's do not call this score, kind of misleading

src/snowflake/snowpark/_internal/analyzer/expression.py

sfc-gh-yzou · 2024-06-12T23:17:17Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan_node.py

+            PlanNodeCategory.ORDER_BY.value: 1,
+            PlanNodeCategory.LITERAL.value: 3,  # step, start, count
+            PlanNodeCategory.COLUMN.value: 1,  # id column
+            PlanNodeCategory.LOW_IMPACT.value: 2,  # ROW_NUMBER, GENERATOR


shoudn't row_number and generator counted as function? since those are snowflake functions

yep, we should. I'll fix it

sfc-gh-yzou · 2024-06-12T23:38:30Z

src/snowflake/snowpark/_internal/analyzer/unary_plan_node.py

+    def individual_node_complexity(self) -> Dict[str, int]:
+        # SELECT * RENAME (before AS after, ...) FROM child
+        return {
+            PlanNodeCategory.COLUMN.value: 1 + 2 * len(self.column_map),


why it has a 2* here? shouldn't before as after counted as 1 column?

for each col1 AS col2, we are counting complexity as col1.complexity + col2.complexity. Which makes 2* len(column_map) in this case

sfc-gh-yzou · 2024-06-12T23:39:01Z

src/snowflake/snowpark/_internal/analyzer/unary_plan_node.py

+        return {
+            PlanNodeCategory.COLUMN.value: 1 + 2 * len(self.column_map),
+            PlanNodeCategory.LOW_IMPACT.value: 1 + len(self.column_map),
+        }


what is the 1 value for is that for the alias name?

1 if for * because the sql expression is

SELECT * RENAME (before_col AS after_col, ....) from child

sfc-gh-yzou · 2024-06-12T23:41:11Z

tests/integ/test_query_plan_analysis.py

+        {
+            PlanNodeCategory.COLUMN.value: 3,
+            PlanNodeCategory.LOW_IMPACT.value: 1,
+            PlanNodeCategory.FUNCTION.value: 1,


oh, where does the function comes from? is that from aliasing function?

it comes from AVG

sfc-gh-yzou · 2024-06-13T21:47:43Z

src/snowflake/snowpark/_internal/analyzer/expression.py

+    @property
+    def cumulative_node_complexity(self) -> Dict[PlanNodeCategory, int]:
+        """Returns the aggregate sum complexity statistic from the subtree rooted at this
+        expression node. Statistic of current node is included in the final aggregate.


add a comment here to mention that the node complexity is the sum of its individual complexity and children complicity, please make sure overwrite the individual_node_complexity properly for new nodes to get the cumulative complexity correct.

github-actions · 2024-06-13T22:08:47Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-06-14T17:05:57Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-aalam added 2 commits May 21, 2024 17:21

Calculate query complexity

e0ff34d

make subtree computation iterative; telemetry

7f714e3

sfc-gh-aalam added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label May 22, 2024

sfc-gh-aalam added 4 commits May 28, 2024 21:28

compute complexity for expressions

b279769

add tests

4582bb2

tests passing

7abd06c

fix test

1e3a34c

Merge branch 'main' into aalam-SNOW-1432019-calculate-candidacy-scores

aad17ee

sfc-gh-aalam added 2 commits May 30, 2024 09:15

fix typing issues

46d8b9f

Merge branch 'aalam-SNOW-1432019-calculate-candidacy-scores' of githu…

d38b3a4

…b.com:snowflakedb/snowpark-python into aalam-SNOW-1432019-calculate-candidacy-scores

use new approach

88c7e7c

fix type checks

fb68aa0

fix async job test

8da7ffd

fix async job test

8474b5e

sfc-gh-aalam marked this pull request as ready for review June 4, 2024 23:24

sfc-gh-aalam requested a review from a team as a code owner June 4, 2024 23:24

sfc-gh-aalam requested review from sfc-gh-jdu, sfc-gh-yuwang, sfc-gh-jrose and sfc-gh-yzou June 4, 2024 23:24

sfc-gh-jdu reviewed Jun 5, 2024

View reviewed changes

sfc-gh-aalam added 2 commits June 4, 2024 20:12

remove change added in error

f72ce8c

remove change added in error

310fbd2

add unit test

8607aa4

fix typing

3830820

align with doc

68855ac

Merge branch 'main' into aalam-SNOW-1432019-calculate-candidacy-scores

40fc65d

sfc-gh-jdu approved these changes Jun 7, 2024

View reviewed changes

sfc-gh-yzou reviewed Jun 7, 2024

View reviewed changes

sfc-gh-aalam added 4 commits June 12, 2024 08:29

fix SelectSQL

1299ee9

rename to node_complexity; add setter for cumulative complexity expre…

4a901f4

…ssion

use Dict type hint instead of Counter

a555e1d

rename dict add function

3008878

sfc-gh-yzou reviewed Jun 12, 2024

View reviewed changes

sfc-gh-aalam added 2 commits June 13, 2024 11:07

fix type hints using enums and do not count alias twice

4bd8ce6

align complexity stat calculation

9035a70

sfc-gh-yzou self-requested a review June 13, 2024 21:38

fix telemetry test

b12b7f6

sfc-gh-yzou approved these changes Jun 13, 2024

View reviewed changes

sfc-gh-aalam added 2 commits June 13, 2024 14:57

update comment

3f6c997

merge with main

0c1c9da

fix unit test

8a03068

sfc-gh-aalam merged commit 76dd3c5 into main Jun 14, 2024
34 checks passed

sfc-gh-aalam deleted the aalam-SNOW-1432019-calculate-candidacy-scores branch June 14, 2024 17:55

github-actions bot locked and limited conversation to collaborators Jun 14, 2024

sfc-gh-aalam restored the aalam-SNOW-1432019-calculate-candidacy-scores branch June 14, 2024 17:56

	lambda expr: expr.cumulative_complexity_stat
	getattr(expr, "cumulative_complexity_stat", Counter({ComplexityStat.COLUMN.value: 1}))


		KT = typing.TypeVar("KT")

		class Counter(collections.Counter, typing.Counter[KT]):

		OTHERS = "others"


		def sum_node_complexities(*node_complexities: Dict[str, int]) -> Dict[str, int]:

SNOW-1432019 Calculate subtree query complexity #1657

SNOW-1432019 Calculate subtree query complexity #1657

Conversation

sfc-gh-aalam commented May 22, 2024 • edited Loading

github-actions bot commented May 30, 2024

github-actions bot commented May 30, 2024

github-actions bot commented May 30, 2024

github-actions bot commented Jun 4, 2024

github-actions bot commented Jun 4, 2024

github-actions bot commented Jun 4, 2024

github-actions bot commented Jun 4, 2024

sfc-gh-jdu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-yzou commented Jun 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 13, 2024

github-actions bot commented Jun 14, 2024

sfc-gh-aalam commented May 22, 2024 •

edited

Loading