Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1520022: Issues with VariantType Column Casting and Data Collection in Snowpark-Python #1881

Closed
hima-gopisetty opened this issue Jul 4, 2024 · 4 comments
Assignees
Labels
local testing Local Testing issues/PRs status-triage_done Initial triage done, will be further handled by the driver team

Comments

@hima-gopisetty
Copy link

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.11.8
    
  2. What are the Snowpark Python and pandas versions in the environment?

    pandas==1.5.3
    snowflake-snowpark-python==1.14.0
    
  3. What did you do?

I attempted to create a DataFrame containing configuration metadata using Snowpark-Python. The metadata includes a field autoSuspendDuration with a numeric value.
Encountered two issues:

  • Issue 1: When casting the autoSuspendDuration field to IntegerType, the resulting value was Decimal('59.0') instead of the expected 59.
  • Issue 2: Attempting to collect data from the DataFrame with the autoSuspendDuration field resulted in a TypeError: Object of type int64 is not JSON serializable.

Here's the complete runnable program:

from snowflake.snowpark.functions import lit, col, length, cast
from snowflake.snowpark.types import StringType, StructField, StructType, TimestampType, IntegerType, VariantType
from snowflake.snowpark import Session

# Create a session - ensure you set up your own configuration appropriately
session = Session.builder.config('local_testing', True).create()

def warehouse_metadata():
    return {
        "autoResume": "true",
        "autoSuspendDuration": 59,
        "clusterScalingPolicy": "STANDARD",
        "maxClusterCount": 1,
        "minClusterCount": 1,
        "queryTimeLimit": 7200,
    }

def create_warehouse_data(sess):
    return sess.create_dataframe(
        [
            [
                "WHS",
                warehouse_metadata(),
            ],
        ],
        schema=StructType(
            [
                StructField("WH_PREFIX", StringType()),
                StructField("CONFIG_METADATA", VariantType()),
            ]
        ),
    )

df = create_warehouse_data(session)
print(df.schema)
df.show()

suspend_duration_df = df.select(col("CONFIG_METADATA")["autoSuspendDuration"].alias("autoSuspendDuration"))

# Issue 1: Column type should be IntegerType but gives Decimal('59.0')
int_cast_df = suspend_duration_df.select(cast(suspend_duration_df.autoSuspendDuration, IntegerType()).alias("autoSuspendDuration"))
print(int_cast_df.collect())
print(suspend_duration_df.schema)
suspend_duration_df.show()

# Issue 2: Collecting data from the dataframe autoSuspendDuration VariantType() column throws TypeError
suspend_duration_df.collect()
  1. What did you expect to see?

    • For Issue 1, I expected the value of the autoSuspendDuration column to be of type IntegerType and have a value of 59, but it is giving a value of Decimal('59.0').
    • For Issue 2, I expected to successfully collect the data from the dataframe autoSuspendDuration column but throws a TypeError: Object of type int64 is not JSON serializable.
      Error Traceback:
    Traceback (most recent call last):
    ...
      suspend_duration_df.collect()
    ...
      result = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
    ...
      return self._internal_collect_with_tag_no_telemetry(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...
      return self._session._conn.execute(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...
      res.loc[idx, col] = json.dumps(
                          ^^^^^^^^^^^
    ...
      chunks = list(chunks)
               ^^^^^^^^^^^^
    ...
      o = _default(o)
          ^^^^^^^^^^^
    ...
      raise TypeError(f'Object of type {o.__class__.__name__} '
    TypeError: Object of type int64 is not JSON serializable
    

Issue Descriptions:

  • Issue 1: When casting the autoSuspendDuration value from a VariantType to IntegerType, the resulting value is Decimal('59.0') instead of 59.
  • Issue 2: Attempting to collect data from the dataframe containing the autoSuspendDuration column of type VariantType throws a TypeError: Object of type int64 is not JSON serializable.
@hima-gopisetty hima-gopisetty added bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required labels Jul 4, 2024
@github-actions github-actions bot changed the title Issues with VariantType Column Casting and Data Collection in Snowpark-Python SNOW-1520022: Issues with VariantType Column Casting and Data Collection in Snowpark-Python Jul 4, 2024
@sfc-gh-sghosh sfc-gh-sghosh self-assigned this Jul 5, 2024
@sfc-gh-sghosh
Copy link

Hello @hima-gopisetty ,

Thanks for raising the issue, we are looking into it, will update.

Regards,
Sujan

@sfc-gh-sghosh sfc-gh-sghosh added the status-triage Issue is under initial triage label Jul 5, 2024
@trakmaker
Copy link

@hima-gopisetty, Similar to my last comment. I believe you are running commands directly in the terminal, that's when I got the same error as yours but please correct me if I am wrong. Executing the same inside the Python environment works fine for me.

PS: I am using a very similar configuration to your environment setup

@sfc-gh-sghosh
Copy link

Hello @hima-gopisetty ,

I checked the code snippet and ran it with latest Snowflake python snowpark version 1.19.0 and there is no error, its working as expected with both local and session object. Its getting casting to IntegerType and the output is 59 and not 59.0

Please run it with the latest Snowpark Python 1.19.0 and let us know.

Here is the output:

`StructType([StructField('WH_PREFIX', StringType(16777216), nullable=True), StructField('CONFIG_METADATA', VariantType(), nullable=True)])

|"WH_PREFIX" |"CONFIG_METADATA" |

|WHS |{ |
| | "autoResume": "true", |
| | "autoSuspendDuration": 59, |
| | "clusterScalingPolicy": "STANDARD", |
| | "maxClusterCount": 1, |
| | "minClusterCount": 1, |
| | "queryTimeLimit": 7200 |
| |} |

[Row(AUTOSUSPENDDURATION=59)]
StructType([StructField('AUTOSUSPENDDURATION', VariantType(), nullable=True)])

|"AUTOSUSPENDDURATION" |

|59 |

[Row(AUTOSUSPENDDURATION='59')]`

Regards,
Sujan

@sfc-gh-sghosh sfc-gh-sghosh added status-triage_done Initial triage done, will be further handled by the driver team and removed bug Something isn't working needs triage Initial RCA is required status-triage Issue is under initial triage labels Jul 7, 2024
@sfc-gh-sghosh
Copy link

Hello @hima-gopisetty ,

Closing the issue as it's working fine with the latest snowflake connector, so please use latest snowflake python connector.

Regards,
Sujan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
local testing Local Testing issues/PRs status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

3 participants