-
Notifications
You must be signed in to change notification settings - Fork 118
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Can track types for getting schema, but can't use the types to custom…
…ize the generated sql. Comments in new_types_demo Signed-off-by: sfc-gh-mvashishtha <[email protected]>
- Loading branch information
1 parent
c5c8004
commit f8491a1
Showing
10 changed files
with
198 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
import modin.pandas as pd | ||
import snowflake.snowpark.modin.plugin | ||
import numpy as np | ||
import pandas as native_pd | ||
from snowflake.snowpark.session import Session; session = Session.builder.create() | ||
import logging; logging.getLogger("snowflake.snowpark").setLevel(logging.DEBUG) | ||
|
||
|
||
### Section 1. Timedelta | ||
df = pd.DataFrame([[pd.Timestamp(year=2020,month=11,day=11,second=30), pd.Timestamp(year=2019,month=10,day=10,second=1)]]) | ||
|
||
# check we can print dataframe | ||
print(df) | ||
|
||
# The schema has the correct snowpark types. | ||
print(df._query_compiler._modin_frame.ordered_dataframe._dataframe_ref.snowpark_dataframe.schema) | ||
|
||
timedelta_result = df[0] - df[1] | ||
|
||
# The timedelta type shows up as the last column in the schema! | ||
print(timedelta_result._query_compiler._modin_frame.ordered_dataframe._dataframe_ref.snowpark_dataframe.schema) | ||
|
||
|
||
# However, Snowflake still raises a type error because the expression types aren't available at the point where we generate the SQL, | ||
# so we can't decide to use datediff instead of regular subtraction. | ||
print(df[0] - df[1]) | ||
|
||
# adding timestamp to timedelta | ||
|
||
# adding two timedelta | ||
|
||
# timestamp + (timedelta + timedelta) | ||
|
||
### Section 2. Interval | ||
|
||
# Interval | ||
# df = pd.DataFrame([pd.Interval(1, 3, closed='left'), pd.Interval(5, 7, closed='both')]) | ||
# print(df) | ||
# dfp = df._to_pandas() | ||
# print(dfp) | ||
# print(list(dfp[0])) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
problem: we build an expression col('a') - col('b') that has no type at any level. | ||
|
||
it only becomes typed once we select it into a Selectable (like "SELECT [...] FROM [...]), at which point it's not an expression but a Selectable. | ||
|
||
something like a Selectable could deduce the new types and also rewrite the expressions into the correct form. | ||
|
||
2a. Could do it in SelectStatement, which deals with Expression. but then we cannot make any sense of the myriad snowpark function calls | ||
|
||
2b. Could do it in Snowpark DataFrame, which deals with Column. | ||
- problem: dataframe.select also gets expressions. how can the dataframe even identify what it's selecting, let alone tell us how to translate the types? | ||
chicken and egg: i'm trying to select f(g(h("A"))), but i can't tell how to invoke h without knowing the type of "A" itself. | ||
How did we solve this problem in our prototype? we didn't. we never really successfully propagated the snowpark type up to the pandas layer, | ||
and we were probably not tracking the snowpark type correctly. | ||
We did the SQL translation in the pandas layer, and to convert to pandas, we did "don't count on our type tracking to tell us which columns consist of native_pd.Interval." | ||
we guessed from the JSON values themselves that we were dealing with timedelta / interval rather than the correct type. | ||
|
||
functions, which deal with Column objects, need to be able to tell snowpark what the result types are. | ||
|
||
what I prototyped for Option 2 wasn't really what I wrote about for Option 2 in the design doc. | ||
|
||
pd.DataFrame([[pd.Timestamp(year=2020,month=11,day=11,second=30), pd.Timestamp(year=2019,month=10,day=10,second=1)]]) | ||
|
||
makes something like | ||
|
||
Select( | ||
Column("__index__"), | ||
Column(alias to "0", | ||
FunctionExpression( | ||
'to_timestamp', | ||
UnresolvedAttribute("0"), | ||
) | ||
) | ||
Column(alias to "1", | ||
FunctionExpression( | ||
'to_timestamp', | ||
UnresolvedAttribute("1"), | ||
) | ||
) | ||
) | ||
|
||
currently we get the schema for that by generating the SQL and asking snowflake. | ||
|
||
but if we knew the schema of what we're querying, shouldn't we be able to deduce the schema of the result without asking snowflake again? | ||
|
||
on SelectStatement: | ||
|
||
self.input_schema = self.input_data.schema | ||
|
||
say I call SelectStatement.select(), then for each thing I'm selecting, I use its lazy type inference code to deduce its type based on the input types. | ||
|
||
|
||
select_statement.select(h(f(g(A + B)), 3)) |
I would probably go the route here of adding functions
register_compound_user_type(name="TimeDelta", members={"days":LongType(), "seconds":LongType(), ...})
andunregister_compound_user_type(name="TimeDelta")
here.Then, within a session (for which the types should be alive), I would keep these around and track them separately. This would allow to simplify logic.