Create cudf example that runs on Snowflake #494

ncclementi · 2025-01-16T01:33:52Z

Tracking the showcasing of the example of #419 in a separate issue.

Once the deployment PR #493 is ready can proceed to draft an example:

TODO:

Get data in snowflake
Read data from snowflake table with cudf in a notebook in the service we deployed. (blocked on [BUG] load_ext cudf.pandas slows down snowpark to_pandas cudf#17775 )
Write a docs page.

The text was updated successfully, but these errors were encountered:

ncclementi · 2025-01-17T01:13:08Z

Got an initial poc working on reading data from a table in specific database, from within the RAPIDS snowpark conainer service.

Created a database with acountadmin role, just to explore the case where the database is not created with the container role.
Created a simple table in there.
Had to grant permissions to the container tole to have access to MY_DATABASE

CREATE DATABASE naty_snowflake_test;

USE DATABASE naty_snowflake_test; 

CREATE OR REPLACE TABLE example_table (
    id INT,
    name STRING,
    age INT
);

INSERT INTO example_table (id, name, age) 
VALUES 
    (1, 'Alice', 30),
    (2, 'Bob', 25),
    (3, 'Charlie', 35);

SELECT * FROM example_table; 


-- Ensure the role has USAGE permissions on the database and schema
GRANT USAGE ON DATABASE NATY_SNOWFLAKE_TEST TO ROLE CONTAINER_USER_ROLE;
GRANT USAGE ON SCHEMA NATY_SNOWFLAKE_TEST.PUBLIC TO ROLE CONTAINER_USER_ROLE;

-- Ensure the role has SELECT permission on the table
GRANT SELECT ON TABLE NATY_SNOWFLAKE_TEST.PUBLIC.example_table TO ROLE CONTAINER_USER_ROLE;

once that was granted. in a notebook in the container service I follow the snowflake quickstart demo and tweaked it.

from snowflake.snowpark import Session
import os

def get_login_token():
    with open('/snowflake/session/token', 'r') as f:
        return f.read()

connection_parameters = {
    "account": os.getenv('SNOWFLAKE_ACCOUNT'),
    "host": os.getenv('SNOWFLAKE_HOST'),
    "token": get_login_token(),
    "authenticator": "oauth",
    "database": "NATY_SNOWFLAKE_TEST", # the created database
    "schema": "PUBLIC",
    "warehouse": "CONTAINER_HOL_WH",
}

session = Session.builder.configs(connection_parameters).create()
df = session.table("example_table") 
pd_df = df.to_pandas()

Got a pandas dataframe.

TODO:

Get parking tickets data in parquet format into a table in snowflake
Sort out what kind of permissions we need to grant to the container role to read and run the cudf-pandas-demo from the container service.
Create docs example.

ncclementi · 2025-01-18T02:18:24Z

I made progress towards this, I was able to get the parking data into snowflake, after a painful learning process. Leaving this here for documentation purposes. You need a permissive enough role that allows you to use a databse and create tables on it, then:

USE DATABASE naty_snowflake_test; -- dummy DB I'm using for experimentation

CREATE OR REPLACE FILE FORMAT my_parquet_format
TYPE = 'PARQUET';

CREATE OR REPLACE STAGE my_s3_stage
URL = 's3://rapidsai-data/datasets/nyc_parking/'
FILE_FORMAT = my_parquet_format;

SELECT COLUMN_NAME, TYPE
FROM TABLE(
    INFER_SCHEMA(
        LOCATION => '@my_s3_stage',
        FILE_FORMAT => 'my_parquet_format',
        FILES => ('nyc_parking_violations_2022.parquet')
    )
);


CREATE OR REPLACE TABLE nyc_parking_violations
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
        LOCATION => '@my_s3_stage',
        FILE_FORMAT => 'my_parquet_format',
        FILES => ('nyc_parking_violations_2022.parquet')
        )
      ));

      
COPY INTO nyc_parking_violations
FROM @my_s3_stage
FILES = ('nyc_parking_violations_2022.parquet')
FILE_FORMAT = (TYPE = 'PARQUET')
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

ncclementi · 2025-01-21T20:13:45Z

So, for this part we are sort of blocked.

I'm at tha point that have data in a snowflake table in a a database, that I'm trying to read in a notebook inside the service container and convert to pandas dataframe.

Problem: When %load_ext cudf.pandas is activated, doing a to_pandas() (if using a snowpark session) or .fetch_pandas_all() (if using a snowflake connector) takes significantly longer.

Time to convert to pandas
Without extension activated: 5 s
With extension activated: 2 min

I was suggested to try to use the following:

from cudf.pandas.module_accelerator import disable_module_accelerator

with disable_module_accelerator():
    df = cur.fetch_pandas_all()

But after 5 min I keyboard interrupted.

That being said after the dataframe is converted things run as expected. But the long time converting to pandas is a problem at the moment.

Note: I'm working on a reproducible example just reading from a public dataset to see be able to report this properly.

ncclementi · 2025-01-21T22:57:30Z

xref: rapidsai/cudf#17775

ncclementi · 2025-01-25T00:15:24Z

Small update and documenting next steps:

There is a PR in progress to have a workaround the slowdown in to_pandas() when having the cudf extension loaded see rapidsai/cudf#17811.

This means, that for the example we will have to use explain the following

from cudf.pandas.module_accelerator import disable_module_accelerator

with disable_module_accelerator():
    df = cur.fetch_pandas_all()

In my opinion, asking users to disable cudf.pandas for the to_pandas() transfer is a tough and not ideal, but maybe okay in the meantime.

For a bit more context and I quote @galipremsagar:

The reason to_pandas (of snowflake) is inefficient is they are creating multiple-threads and each thread iterates over multiple chunks of dataframes and then combines all those slices into one frame in the end. Iteration is where cudf is very inefficient and adding on the serialization and deserialization with multi-threads is another inefficiency for us. - This is the reason we will need to use disable_module_accelerator for this API call.

So the next steps are:

Merge Resolve race-condition in disable_module_accelerator cudf#17811 once is ready.
Wait for the nightly to be available with such fix or wait for next stable release,
Create docker image and test Snowpark workflow
Document the example in the docs, following the steps mentioned in the previous comments.

Out of scope of this Issue:

Sort out with snowpark/snowflake how to fix this particular conversion.

ncclementi self-assigned this Jan 16, 2025

ncclementi mentioned this issue Jan 16, 2025

Rapids on snowflake deployment #493

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create cudf example that runs on Snowflake #494

Create cudf example that runs on Snowflake #494

ncclementi commented Jan 16, 2025 •

edited

Loading

ncclementi commented Jan 17, 2025 •

edited

Loading

ncclementi commented Jan 18, 2025

ncclementi commented Jan 21, 2025 •

edited

Loading

ncclementi commented Jan 21, 2025

ncclementi commented Jan 25, 2025

Create cudf example that runs on Snowflake #494

Create cudf example that runs on Snowflake #494

Comments

ncclementi commented Jan 16, 2025 • edited Loading

ncclementi commented Jan 17, 2025 • edited Loading

ncclementi commented Jan 18, 2025

ncclementi commented Jan 21, 2025 • edited Loading

ncclementi commented Jan 21, 2025

ncclementi commented Jan 25, 2025

ncclementi commented Jan 16, 2025 •

edited

Loading

ncclementi commented Jan 17, 2025 •

edited

Loading

ncclementi commented Jan 21, 2025 •

edited

Loading