Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create cudf example that runs on Snowflake #494

Open
2 of 3 tasks
ncclementi opened this issue Jan 16, 2025 · 5 comments
Open
2 of 3 tasks

Create cudf example that runs on Snowflake #494

ncclementi opened this issue Jan 16, 2025 · 5 comments
Assignees

Comments

@ncclementi
Copy link
Contributor

ncclementi commented Jan 16, 2025

Tracking the showcasing of the example of #419 in a separate issue.

Once the deployment PR #493 is ready can proceed to draft an example:

TODO:

@ncclementi ncclementi self-assigned this Jan 16, 2025
@ncclementi
Copy link
Contributor Author

ncclementi commented Jan 17, 2025

Got an initial poc working on reading data from a table in specific database, from within the RAPIDS snowpark conainer service.

  • Created a database with acountadmin role, just to explore the case where the database is not created with the container role.
  • Created a simple table in there.
  • Had to grant permissions to the container tole to have access to MY_DATABASE
CREATE DATABASE naty_snowflake_test;

USE DATABASE naty_snowflake_test; 

CREATE OR REPLACE TABLE example_table (
    id INT,
    name STRING,
    age INT
);

INSERT INTO example_table (id, name, age) 
VALUES 
    (1, 'Alice', 30),
    (2, 'Bob', 25),
    (3, 'Charlie', 35);

SELECT * FROM example_table; 


-- Ensure the role has USAGE permissions on the database and schema
GRANT USAGE ON DATABASE NATY_SNOWFLAKE_TEST TO ROLE CONTAINER_USER_ROLE;
GRANT USAGE ON SCHEMA NATY_SNOWFLAKE_TEST.PUBLIC TO ROLE CONTAINER_USER_ROLE;

-- Ensure the role has SELECT permission on the table
GRANT SELECT ON TABLE NATY_SNOWFLAKE_TEST.PUBLIC.example_table TO ROLE CONTAINER_USER_ROLE;
  • once that was granted. in a notebook in the container service I follow the snowflake quickstart demo and tweaked it.
from snowflake.snowpark import Session
import os

def get_login_token():
    with open('/snowflake/session/token', 'r') as f:
        return f.read()

connection_parameters = {
    "account": os.getenv('SNOWFLAKE_ACCOUNT'),
    "host": os.getenv('SNOWFLAKE_HOST'),
    "token": get_login_token(),
    "authenticator": "oauth",
    "database": "NATY_SNOWFLAKE_TEST", # the created database
    "schema": "PUBLIC",
    "warehouse": "CONTAINER_HOL_WH",
}

session = Session.builder.configs(connection_parameters).create()
df = session.table("example_table") 
pd_df = df.to_pandas()

Got a pandas dataframe.

TODO:

  • Get parking tickets data in parquet format into a table in snowflake
  • Sort out what kind of permissions we need to grant to the container role to read and run the cudf-pandas-demo from the container service.
  • Create docs example.

@ncclementi
Copy link
Contributor Author

I made progress towards this, I was able to get the parking data into snowflake, after a painful learning process. Leaving this here for documentation purposes. You need a permissive enough role that allows you to use a databse and create tables on it, then:

USE DATABASE naty_snowflake_test; -- dummy DB I'm using for experimentation

CREATE OR REPLACE FILE FORMAT my_parquet_format
TYPE = 'PARQUET';

CREATE OR REPLACE STAGE my_s3_stage
URL = 's3://rapidsai-data/datasets/nyc_parking/'
FILE_FORMAT = my_parquet_format;

SELECT COLUMN_NAME, TYPE
FROM TABLE(
    INFER_SCHEMA(
        LOCATION => '@my_s3_stage',
        FILE_FORMAT => 'my_parquet_format',
        FILES => ('nyc_parking_violations_2022.parquet')
    )
);


CREATE OR REPLACE TABLE nyc_parking_violations
  USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
      FROM TABLE(
        INFER_SCHEMA(
        LOCATION => '@my_s3_stage',
        FILE_FORMAT => 'my_parquet_format',
        FILES => ('nyc_parking_violations_2022.parquet')
        )
      ));

      
COPY INTO nyc_parking_violations
FROM @my_s3_stage
FILES = ('nyc_parking_violations_2022.parquet')
FILE_FORMAT = (TYPE = 'PARQUET')
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

@ncclementi
Copy link
Contributor Author

ncclementi commented Jan 21, 2025

So, for this part we are sort of blocked.

I'm at tha point that have data in a snowflake table in a a database, that I'm trying to read in a notebook inside the service container and convert to pandas dataframe.

Problem: When %load_ext cudf.pandas is activated, doing a to_pandas() (if using a snowpark session) or .fetch_pandas_all() (if using a snowflake connector) takes significantly longer.

Time to convert to pandas
Without extension activated: 5 s
With extension activated: 2 min

I was suggested to try to use the following:

from cudf.pandas.module_accelerator import disable_module_accelerator

with disable_module_accelerator():
    df = cur.fetch_pandas_all()

But after 5 min I keyboard interrupted.

That being said after the dataframe is converted things run as expected. But the long time converting to pandas is a problem at the moment.

Note: I'm working on a reproducible example just reading from a public dataset to see be able to report this properly.

@ncclementi
Copy link
Contributor Author

xref: rapidsai/cudf#17775

@ncclementi
Copy link
Contributor Author

Small update and documenting next steps:

There is a PR in progress to have a workaround the slowdown in to_pandas() when having the cudf extension loaded see rapidsai/cudf#17811.

This means, that for the example we will have to use explain the following

from cudf.pandas.module_accelerator import disable_module_accelerator

with disable_module_accelerator():
    df = cur.fetch_pandas_all() 

In my opinion, asking users to disable cudf.pandas for the to_pandas() transfer is a tough and not ideal, but maybe okay in the meantime.

For a bit more context and I quote @galipremsagar:

The reason to_pandas (of snowflake) is inefficient is they are creating multiple-threads and each thread iterates over multiple chunks of dataframes and then combines all those slices into one frame in the end. Iteration is where cudf is very inefficient and adding on the serialization and deserialization with multi-threads is another inefficiency for us. - This is the reason we will need to use disable_module_accelerator for this API call.

So the next steps are:

Out of scope of this Issue:

  • Sort out with snowpark/snowflake how to fix this particular conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant