-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create cudf example that runs on Snowflake #494
Comments
Got an initial poc working on reading data from a table in specific database, from within the RAPIDS snowpark conainer service.
CREATE DATABASE naty_snowflake_test;
USE DATABASE naty_snowflake_test;
CREATE OR REPLACE TABLE example_table (
id INT,
name STRING,
age INT
);
INSERT INTO example_table (id, name, age)
VALUES
(1, 'Alice', 30),
(2, 'Bob', 25),
(3, 'Charlie', 35);
SELECT * FROM example_table;
-- Ensure the role has USAGE permissions on the database and schema
GRANT USAGE ON DATABASE NATY_SNOWFLAKE_TEST TO ROLE CONTAINER_USER_ROLE;
GRANT USAGE ON SCHEMA NATY_SNOWFLAKE_TEST.PUBLIC TO ROLE CONTAINER_USER_ROLE;
-- Ensure the role has SELECT permission on the table
GRANT SELECT ON TABLE NATY_SNOWFLAKE_TEST.PUBLIC.example_table TO ROLE CONTAINER_USER_ROLE;
from snowflake.snowpark import Session
import os
def get_login_token():
with open('/snowflake/session/token', 'r') as f:
return f.read()
connection_parameters = {
"account": os.getenv('SNOWFLAKE_ACCOUNT'),
"host": os.getenv('SNOWFLAKE_HOST'),
"token": get_login_token(),
"authenticator": "oauth",
"database": "NATY_SNOWFLAKE_TEST", # the created database
"schema": "PUBLIC",
"warehouse": "CONTAINER_HOL_WH",
}
session = Session.builder.configs(connection_parameters).create()
df = session.table("example_table")
pd_df = df.to_pandas() Got a pandas dataframe. TODO:
|
I made progress towards this, I was able to get the parking data into snowflake, after a painful learning process. Leaving this here for documentation purposes. You need a permissive enough role that allows you to use a databse and create tables on it, then: USE DATABASE naty_snowflake_test; -- dummy DB I'm using for experimentation
CREATE OR REPLACE FILE FORMAT my_parquet_format
TYPE = 'PARQUET';
CREATE OR REPLACE STAGE my_s3_stage
URL = 's3://rapidsai-data/datasets/nyc_parking/'
FILE_FORMAT = my_parquet_format;
SELECT COLUMN_NAME, TYPE
FROM TABLE(
INFER_SCHEMA(
LOCATION => '@my_s3_stage',
FILE_FORMAT => 'my_parquet_format',
FILES => ('nyc_parking_violations_2022.parquet')
)
);
CREATE OR REPLACE TABLE nyc_parking_violations
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION => '@my_s3_stage',
FILE_FORMAT => 'my_parquet_format',
FILES => ('nyc_parking_violations_2022.parquet')
)
));
COPY INTO nyc_parking_violations
FROM @my_s3_stage
FILES = ('nyc_parking_violations_2022.parquet')
FILE_FORMAT = (TYPE = 'PARQUET')
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE; |
So, for this part we are sort of blocked. I'm at tha point that have data in a snowflake table in a a database, that I'm trying to read in a notebook inside the service container and convert to pandas dataframe. Problem: When Time to convert to pandas I was suggested to try to use the following:
But after 5 min I keyboard interrupted. That being said after the dataframe is converted things run as expected. But the long time converting to pandas is a problem at the moment. Note: I'm working on a reproducible example just reading from a public dataset to see be able to report this properly. |
xref: rapidsai/cudf#17775 |
Small update and documenting next steps: There is a PR in progress to have a workaround the slowdown in This means, that for the example we will have to use explain the following from cudf.pandas.module_accelerator import disable_module_accelerator
with disable_module_accelerator():
df = cur.fetch_pandas_all() In my opinion, asking users to disable cudf.pandas for the For a bit more context and I quote @galipremsagar:
So the next steps are:
Out of scope of this Issue:
|
Tracking the showcasing of the example of #419 in a separate issue.
Once the deployment PR #493 is ready can proceed to draft an example:
TODO:
to_pandas
cudf#17775 )The text was updated successfully, but these errors were encountered: