Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1432019 Calculate subtree query complexity #1657

Merged
merged 41 commits into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
e0ff34d
Calculate query complexity
sfc-gh-aalam May 22, 2024
7f714e3
make subtree computation iterative; telemetry
sfc-gh-aalam May 22, 2024
b279769
compute complexity for expressions
sfc-gh-aalam May 29, 2024
4582bb2
add tests
sfc-gh-aalam May 29, 2024
7abd06c
tests passing
sfc-gh-aalam May 30, 2024
1e3a34c
fix test
sfc-gh-aalam May 30, 2024
aad17ee
Merge branch 'main' into aalam-SNOW-1432019-calculate-candidacy-scores
sfc-gh-aalam May 30, 2024
46d8b9f
fix typing issues
sfc-gh-aalam May 30, 2024
d38b3a4
Merge branch 'aalam-SNOW-1432019-calculate-candidacy-scores' of githu…
sfc-gh-aalam May 30, 2024
88c7e7c
use new approach
sfc-gh-aalam Jun 4, 2024
fb68aa0
fix type checks
sfc-gh-aalam Jun 4, 2024
8da7ffd
fix async job test
sfc-gh-aalam Jun 4, 2024
8474b5e
fix async job test
sfc-gh-aalam Jun 4, 2024
f72ce8c
remove change added in error
sfc-gh-aalam Jun 5, 2024
310fbd2
remove change added in error
sfc-gh-aalam Jun 5, 2024
e79f277
add description on async fix
sfc-gh-aalam Jun 5, 2024
d5451ce
move Counter typing into complexity stat
sfc-gh-aalam Jun 5, 2024
3b639bb
rename file
sfc-gh-aalam Jun 5, 2024
909ed67
address feedback
sfc-gh-aalam Jun 5, 2024
ed83869
fix bad imports
sfc-gh-aalam Jun 5, 2024
026e428
fix type hints
sfc-gh-aalam Jun 5, 2024
df9bfac
rename
sfc-gh-aalam Jun 5, 2024
ced5d78
rename file
sfc-gh-aalam Jun 5, 2024
995f107
refactor
sfc-gh-aalam Jun 5, 2024
2b2f8a5
fix lint and type hints
sfc-gh-aalam Jun 5, 2024
7f636e6
fix classification for Interval expression
sfc-gh-aalam Jun 5, 2024
e40129d
added some unit tests
sfc-gh-aalam Jun 6, 2024
8607aa4
add unit test
sfc-gh-aalam Jun 6, 2024
3830820
fix typing
sfc-gh-aalam Jun 6, 2024
68855ac
align with doc
sfc-gh-aalam Jun 6, 2024
40fc65d
Merge branch 'main' into aalam-SNOW-1432019-calculate-candidacy-scores
sfc-gh-aalam Jun 7, 2024
1299ee9
fix SelectSQL
sfc-gh-aalam Jun 12, 2024
4a901f4
rename to node_complexity; add setter for cumulative complexity expre…
sfc-gh-aalam Jun 12, 2024
a555e1d
use Dict type hint instead of Counter
sfc-gh-aalam Jun 12, 2024
3008878
rename dict add function
sfc-gh-aalam Jun 12, 2024
4bd8ce6
fix type hints using enums and do not count alias twice
sfc-gh-aalam Jun 13, 2024
9035a70
align complexity stat calculation
sfc-gh-aalam Jun 13, 2024
b12b7f6
fix telemetry test
sfc-gh-aalam Jun 13, 2024
3f6c997
update comment
sfc-gh-aalam Jun 13, 2024
0c1c9da
merge with main
sfc-gh-aalam Jun 13, 2024
8a03068
fix unit test
sfc-gh-aalam Jun 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/snowflake/snowpark/_internal/analyzer/analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -927,7 +927,7 @@ def do_resolve_with_resolved_children(
)

if isinstance(logical_plan, UnresolvedRelation):
return self.plan_builder.table(logical_plan.name)
return self.plan_builder.table(logical_plan.name, logical_plan)
sfc-gh-yzou marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(logical_plan, SnowflakeCreateTable):
return self.plan_builder.save_as_table(
Expand Down
19 changes: 19 additions & 0 deletions src/snowflake/snowpark/_internal/analyzer/binary_expression.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,23 @@
# Copyright (c) 2012-2024 Snowflake Computing Inc. All rights reserved.
#

import sys
from typing import AbstractSet, Optional

# collections.Counter does not pass type checker. Changes with appropriate type hints were made in 3.9+
if sys.version_info <= (3, 9):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if sys.version_info <= (3, 9):
if sys.version_info < (3, 9):

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have this type annotation in one util file and we can import from that place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it to a util file - that's a good idea.

import collections
import typing

KT = typing.TypeVar("KT")

class Counter(collections.Counter, typing.Counter[KT]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare with a simple map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as simple map but it also lets you do map1 + map2
example:
map1 = {a: 1, b: 2}
map2 = {b:1, c:3}
map1 + map2 = {a: 1, b: 3, c: 3}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if map1 + map2 is the only difference, can we simply use map then? Right now the returned type is Counter[str] which is not straight forward to understand with a str, for example, why complexity is represented as a string? to reduce unnecessary confusion in the code, let's use more obvious data structure for representation

pass

else:
from collections import Counter

from snowflake.snowpark._internal.analyzer.complexity_stat import ComplexityStat
from snowflake.snowpark._internal.analyzer.expression import (
Expression,
derive_dependent_columns,
Expand All @@ -26,6 +41,10 @@ def __str__(self):
def dependent_column_names(self) -> Optional[AbstractSet[str]]:
return derive_dependent_columns(self.left, self.right)

@property
def individual_complexity_stat(self) -> Counter[str]:
return Counter({ComplexityStat.LOW_IMPACT.value: 1})


class BinaryArithmeticExpression(BinaryExpression):
pass
Expand Down
40 changes: 39 additions & 1 deletion src/snowflake/snowpark/_internal/analyzer/binary_plan_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,23 @@
# Copyright (c) 2012-2024 Snowflake Computing Inc. All rights reserved.
#

import sys
from typing import List, Optional

# collections.Counter does not pass type checker. Changes with appropriate type hints were made in 3.9+
if sys.version_info <= (3, 9):
import collections
import typing

KT = typing.TypeVar("KT")

class Counter(collections.Counter, typing.Counter[KT]):
pass

else:
from collections import Counter

from snowflake.snowpark._internal.analyzer.complexity_stat import ComplexityStat
from snowflake.snowpark._internal.analyzer.expression import Expression
from snowflake.snowpark._internal.analyzer.snowflake_plan_node import LogicalPlan
from snowflake.snowpark._internal.error_message import SnowparkClientExceptionMessages
Expand Down Expand Up @@ -69,7 +84,10 @@ def __init__(self, left: LogicalPlan, right: LogicalPlan) -> None:


class SetOperation(BinaryNode):
pass
@property
def individual_complexity_stat(self) -> Counter[str]:
# (left) operator (right)
return Counter({ComplexityStat.SET_OPERATION.value: 1})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am looking at this individual_complexity_stat usage, it seems really just about what is the current plan node category, and there is no need to have a map there

Copy link
Contributor Author

@sfc-gh-aalam sfc-gh-aalam Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a single value in my initial implementation but some expressions generate sql in multiple categories like Interval, TableFunctionJoin, Lateral. What I can change here is to do something like this

# default case
class Expression
  @property 
  def plan_node_category(self) -> Enum:
    return PlanNodeCategory.OTHER
    
  @property 
  def individual_query_plan_stat(self) -> Counter[str]:
    return Counter({self.plan_node_category.value: 1})

# single class case
class SetOperation:
  @property
  def plan_node_category(self) -> Enum:
    return PlanNodeCategory.SET_OPERATOR

# multiple case
class TableFunctionJoin:
  @property
  def individual_query_plan_stat(self) -> Counter[str]:
    return {...}



class Except(SetOperation):
Expand Down Expand Up @@ -187,3 +205,23 @@ def __init__(
@property
def sql(self) -> str:
return self.join_type.sql

@property
def individual_complexity_stat(self) -> Counter[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would prefer score instead of stat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am renaming this to individual_node_complexity and cumulative_node_complexity. since the return value is a dictionary, adding score could also be misleading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when i look at those, those are just plan_statistic or plan_info to me, but
I am fine with node_complexity if preferred. score is definitely misleading since we are not really calculating any score here yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I earlier had stat as a short hand for statistic

# SELECT * FROM (left) AS left_alias join_type_sql JOIN (right) AS right_alias match_cond, using_cond, join_cond
estimate = Counter({ComplexityStat.JOIN.value: 1})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not it is not an 'estimate', it is really the state calculation, maybe just call it 'state'

if isinstance(self.join_type, UsingJoin) and self.join_type.using_columns:
estimate += Counter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am little bit confused here, that is for join, right? shouldn't the complexity for join is individual_complexity_stat(left) + individual_complexity_stat(righ) + self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the self component. A join statement can be constructed in different ways. Based on how analyzer generates the final sql, we calculate the contribution. You can see check out analyzer code here:

if isinstance(logical_plan, Join):
join_condition = (
self.analyze(
logical_plan.join_condition, df_aliased_col_name_to_real_col_name
)
if logical_plan.join_condition
else ""
)
match_condition = (
self.analyze(
logical_plan.match_condition, df_aliased_col_name_to_real_col_name
)
if logical_plan.match_condition
else ""
)
return self.plan_builder.join(
resolved_children[logical_plan.left],
resolved_children[logical_plan.right],
logical_plan.join_type,
join_condition,
match_condition,
logical_plan,
self.session.conf.get("use_constant_subquery_alias", False),
)

and here:

def join_statement(
left: str,
right: str,
join_type: JoinType,
join_condition: str,
match_condition: str,
use_constant_subquery_alias: bool,
) -> str:
if isinstance(join_type, (LeftSemi, LeftAnti)):
return left_semi_or_anti_join_statement(
left, right, join_type, join_condition, use_constant_subquery_alias
)
if isinstance(join_type, AsOf):
return asof_join_statement(
left, right, join_condition, match_condition, use_constant_subquery_alias
)
if isinstance(join_type, UsingJoin) and isinstance(
join_type.tpe, (LeftSemi, LeftAnti)
):
raise ValueError(f"Unexpected using clause in {join_type.tpe} join")
return snowflake_supported_join_statement(
left,
right,
join_type,
join_condition,
match_condition,
use_constant_subquery_alias,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think I am still confused here. The join seems a LogicalPlan node, if we do not plan to do it at the plan node level, where do we plan to calculate the accumulated complexity for join then?

{ComplexityStat.COLUMN.value: len(self.join_type.using_columns)}
)
estimate += (
self.join_condition.cumulative_complexity_stat
if self.join_condition
else Counter()
)
estimate += (
self.match_condition.cumulative_complexity_stat
if self.match_condition
else Counter()
)
return estimate
24 changes: 24 additions & 0 deletions src/snowflake/snowpark/_internal/analyzer/complexity_stat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#
# Copyright (c) 2012-2024 Snowflake Computing Inc. All rights reserved.
#

from enum import Enum


class ComplexityStat(Enum):
sfc-gh-aalam marked this conversation as resolved.
Show resolved Hide resolved
FILTER = "filter"
ORDER_BY = "order_by"
JOIN = "join"
SET_OPERATION = "set_operation" # UNION, EXCEPT, INTERSECT, UNION ALL
SAMPLE = "sample"
PIVOT = "pivot"
UNPIVOT = "unpivot"
WINDOW = "window"
GROUP_BY = "group_by"
PARTITION_BY = "partition_by"
CASE_WHEN = "case_when"
LITERAL = "literal"
COLUMN = "column"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the category are very clear like filter pivot, but some are not so clear like column, low impact, function, maybe add some comment there about what does those category includes, for example, all snowflake functions, table functions are counted under function category etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

FUNCTION = "function"
IN = "in"
LOW_IMPACT = "low_impact"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just add one category NO_IMPACT, count how many node we categorized as no impact, then essentially the total number of plan nodes is just the addition of all of those category

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, if we just call it query plan node category, we can probably just call it "others" etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you want to have both "LOW_IMPACT" and "OTHERS"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can have both, then we have a category for each node in the plan tree node, it would be easy for us to calculate the total number of nodes in the tree

Loading
Loading