Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1526571: Mock_iff not working if there are null values in the columns #1887

Closed
frederiksteiner opened this issue Jul 9, 2024 · 7 comments
Assignees
Labels
bug Something isn't working local testing Local Testing issues/PRs status-triage_done Initial triage done, will be further handled by the driver team

Comments

@frederiksteiner
Copy link
Contributor

frederiksteiner commented Jul 9, 2024

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    3.11.8

  2. What operating system and processor architecture are you using?

    Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

  3. What are the component versions in the environment (pip freeze)?

snowflake-connector-python==3.11.0
snowflake-snowpark-python==1.19.0

  1. What did you do?
    Start local testing session and create
from snowflake.snowpark import Session
import snowflake.snowpark.functions as spf
conn_params = {
    "schema": self.schema,
    "local_testing": True,
    "timezone": gc.timezone_local,
    **kwargs,
}

session = Session.builder.configs(conn_params).create()
import snowflake.snowpark.functions as spf
from snowflake.snowpark.window import Window
data = [
    (1, 1, 1, None),
    (1, 1, 2, None),
    (1, 1, 3, 1),
    (1, 1, 4, None),
    (1, 1, 5, None),
    (1, 1, 6, 2),
    (1, 1, 7, None),
    (1, 1, 8, None),
]
schema = ["COL1","COL2","COL3","COLUMN_TO_FILL"]

df = session.create_dataframe(
    data=data,
    schema=schema,
)
window = Window.partition_by(["COL1", "COL2"]).order_by("COL3")
lead = spf.lead(df.col("COLUMN_TO_FILL"), ignore_nulls=True).over(window)
lag = spf.lag(df.col("COLUMN_TO_FILL"), ignore_nulls=True).over(window)

max_lead_lag = spf.iff(lead > lag, lead, lag)
df = df.with_column("MAX_LEAD_LAG",max_lead_lag)
result = df.to_pandas()
expected = pd.DataFrame(
    [
        (1, 1, 1, None, None),
        (1, 1, 2, None, None),
        (1, 1, 3, 1, None),
        (1, 1, 4, None, 2.0),
        (1, 1, 5, None, 2.0),
        (1, 1, 6, 2, 1.0),
        (1, 1, 7, None, 2.0),
        (1, 1, 8, None, 2.0),
    ],
    columns=schema + ["MAX_LEAD_LAG"],
)
pd.testing.assert_frame_equal(
    result,
    expected,
    check_dtype=False,
)
  1. What did you expect to see?

    That this should assert to True and not false.

The column "MAX_LEAD_LAG" is all Nones. The reason for it is that the lead and lag column are NullType columns afterwards and hence the max_lead_lag is full of Nones. Hence it is not really a problem of mock_iff but of the lead and lag testing method returning NullType columns.

@patch("iff")
def mock_iff(condition: ColumnEmulator, expr1: ColumnEmulator, expr2: ColumnEmulator):
    assert isinstance(condition.sf_type.datatype, BooleanType)

    coerce_result = get_coerce_result_type(expr1.sf_type, expr2.sf_type) #### This is none, since both lead and lag are NullTypecolumns
    if all(condition) or all(~condition) or coerce_result is not None:
        res = ColumnEmulator(data=[None] * len(condition), dtype=object)
        expr1 = cast_column_to(expr1, coerce_result) ##### This returns empty
        expr2 = cast_column_to(expr2, coerce_result) ##### This returns empty
        res.where(condition, other=expr2, inplace=True)
        res.where([not x for x in condition], other=expr1, inplace=True)
        res.sf_type = coerce_result
        return res ##### Now this is full of Nones too
    else:
        raise SnowparkLocalTestingException(
            f"[Local Testing] expr1 and expr2 have conflicting datatypes that cannot be coerced: {expr1.sf_type} <-> {expr2.sf_type}"
        )

@frederiksteiner frederiksteiner added bug Something isn't working needs triage Initial RCA is required labels Jul 9, 2024
@github-actions github-actions bot changed the title Mock_iff not working if there are null values in the columns SNOW-1526571: Mock_iff not working if there are null values in the columns Jul 9, 2024
@sfc-gh-sghosh sfc-gh-sghosh self-assigned this Jul 10, 2024
@sfc-gh-sghosh sfc-gh-sghosh added the status-triage Issue is under initial triage label Jul 10, 2024
@sfc-gh-sghosh
Copy link

Hello @frederiksteiner ,

Thanks for raising the issue, we are checking, will update.

Regards,
Sujan

@sfc-gh-dszmolka sfc-gh-dszmolka removed the needs triage Initial RCA is required label Jul 16, 2024
@sfc-gh-sghosh
Copy link

Hello @frederiksteiner ,

I checked the code snippet, there is no issue with normal session, but I am not able to reproduce the exact issue with local testing.

Using snowpark 3.19.0 and Python 3.11
The issue coming when converting the snowflake df to pandas df.

FutureWarning: Mismatched null-like values None and nan found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise.
pd.testing.assert_frame_equal(result, expected, check_dtype=False, check_like=True)

`Result
COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG
0 1 1 1 NaN None
1 1 1 2 NaN None
2 1 1 3 1.0 None
3 1 1 4 NaN None
4 1 1 5 NaN None
5 1 1 6 2.0 None
6 1 1 7 NaN None
7 1 1 8 NaN None

Expected
COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG
0 1 1 1 NaN NaN
1 1 1 2 NaN NaN
2 1 1 3 1.0 NaN
3 1 1 4 NaN 2.0
4 1 1 5 NaN 2.0
5 1 1 6 2.0 1.0
6 1 1 7 NaN 2.0
7 1 1 8 NaN 2.0`

AssertionError: DataFrame.iloc[:, 4] (column name="MAX_LEAD_LAG") are different

DataFrame.iloc[:, 4] (column name="MAX_LEAD_LAG") values are different (62.5 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7]
[left]: [None, None, None, None, None, None, None, None]
[right]: [nan, nan, nan, 2.0, 2.0, 1.0, 2.0, 2.0]
At positional index 3, first diff: None != 2.0

Could you check if any other configurations done with local_testing.

Regards,
Sujan

@sfc-gh-aling sfc-gh-aling added the local testing Local Testing issues/PRs label Jul 22, 2024
@sfc-gh-sghosh sfc-gh-sghosh added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Jul 23, 2024
@sfc-gh-sghosh
Copy link

Hello @frederiksteiner ,

Did you get a chance to check the previous update?

Regards,
Sujan

@frederiksteiner
Copy link
Contributor Author

Hey @sfc-gh-sghosh

i did not yet since I do not have a laptop on me this week, but I will check on Monday and reply again.

Sorry for the inconvenience.
Kind regards,
Frederik

@frederiksteiner
Copy link
Contributor Author

frederiksteiner commented Jul 29, 2024

Hey @sfc-gh-sghosh
You are right, the result and expected looks as you indicated and I didn't copy it correctly. It should look like yours.

`Result
COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG
0 1 1 1 NaN None
1 1 1 2 NaN None
2 1 1 3 1.0 None
3 1 1 4 NaN None
4 1 1 5 NaN None
5 1 1 6 2.0 None
6 1 1 7 NaN None
7 1 1 8 NaN None

Expected
COL1 COL2 COL3 COLUMN_TO_FILL MAX_LEAD_LAG
0 1 1 1 NaN NaN
1 1 1 2 NaN NaN
2 1 1 3 1.0 NaN
3 1 1 4 NaN 2.0
4 1 1 5 NaN 2.0
5 1 1 6 2.0 1.0
6 1 1 7 NaN 2.0
7 1 1 8 NaN 2.0`

But the issue still stands

@sfc-gh-aling
Copy link
Contributor

sfc-gh-aling commented Jul 30, 2024

thanks for the PR @frederiksteiner , I have merged my PR #1959 which is a duplicate of yours with some additional updates.

sorry I couldn't merge yours due to our ci/cd limitations that external PR could not run tests.

@frederiksteiner
Copy link
Contributor Author

Alright, thanks for the new PR fixing this issue. I will close this issue then and also close my PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working local testing Local Testing issues/PRs status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

4 participants