-
Notifications
You must be signed in to change notification settings - Fork 119
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
1 parent
421723d
commit ee931fe
Showing
16 changed files
with
633 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
Performance Recommendations | ||
=========================== | ||
|
||
This page contains recommendations to help improve performance when using the Snowpark pandas API. | ||
|
||
Caching Intermediate Results | ||
---------------------------- | ||
Snowpark pandas uses a lazy paradigm - when operations are called on a Snowpark pandas object, | ||
a lazy operator graph is built up and executed only when an output operation is called (e.g. printing | ||
the data, or persisting it to a table in Snowflake). This paradigm mirrors the Snowpark DataFrame paradigm, | ||
and enables larger queries to be optimized using Snowflake's SQL Query Optimizer. Certain workloads, however, | ||
can generate large operator graphs that include repeated, computationally expensive, subgraphs. | ||
Take the following code snippet as an example: | ||
|
||
.. code-block:: python | ||
import modin.pandas as pd | ||
import numpy as np | ||
import snowflake.snowpark.modin.plugin | ||
from snowflake.snowpark import Session | ||
# Session.builder.create() will create a default Snowflake connection. | ||
Session.builder.create() | ||
df = pd.concat([pd.DataFrame([range(i, i+5)]) for i in range(0, 150, 5)]) | ||
print(df) | ||
df = df.reset_index(drop=True) | ||
print(df) | ||
The above code snippet creates a 30x5 DataFrame using concatenation of 30 smaller 1x5 DataFrames, | ||
prints it, resets its index, and prints it again. The concatenation step can be expensive, and is | ||
lazily recomputed every time the dataframe is materialized - once per print. Instead, we recommend using | ||
Snowpark pandas' ``cache_result`` API in order to materialize expensive computations that are reused | ||
in order to decrease the latency of longer pipelines. | ||
|
||
.. code-block:: python | ||
import modin.pandas as pd | ||
import numpy as np | ||
import snowflake.snowpark.modin.plugin | ||
from snowflake.snowpark import Session | ||
# Session.builder.create() will create a default Snowflake connection. | ||
Session.builder.create() | ||
df = pd.concat([pd.DataFrame([range(i, i+5)]) for i in range(0, 150, 5)]) | ||
df = df.cache_result(inplace=False) | ||
print(df) | ||
df = df.reset_index(drop=True) | ||
print(df) | ||
Consider using the ``cache_result`` API whenever a DataFrame or Series that is expensive to compute sees high reuse. | ||
|
||
Known Limitations | ||
^^^^^^^^^^^^^^^^^ | ||
Using the ``cache_result`` API after an ``apply``, an ``applymap`` or a ``groupby.apply`` is unlikely to yield performance savings. | ||
``apply(func, axis=1)`` when ``func`` has no return type annotation and ``groupby.apply`` are implemented internally via UDTFs, and feature | ||
intermediate result caching as part of their implementation. ``apply(func, axis=1)`` when func has a return type annotation, and ``applymap`` | ||
internally use UDFs - any overhead observed when using these APIs is likely due to the set-up and definition of the UDF, and is unlikely to be | ||
alleviated via the ``cache_result`` API. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# | ||
# Copyright (c) 2012-2024 Snowflake Computing Inc. All rights reserved. | ||
# | ||
|
||
""" | ||
File containing utilities for the extensions API. | ||
""" | ||
from snowflake.snowpark.modin.utils import Fn | ||
|
||
cache_result_docstring = """ | ||
Persists the current Snowpark pandas {object_name} to a temporary table to improve the latency of subsequent operations. | ||
Args: | ||
inplace: bool, default False | ||
Whether to perform the materialization inplace. | ||
Returns: | ||
Snowpark pandas {object_name} or None | ||
Cached Snowpark pandas {object_name} or None if ``inplace=True``. | ||
Note: | ||
- The temporary table produced by this method lasts for the duration of the session. | ||
Examples: | ||
{examples} | ||
""" | ||
|
||
cache_result_examples = """ | ||
Let's make a {object_name} using a computationally expensive operation, e.g.: | ||
>>> {object_var_name} = {object_creation_call} | ||
Due to Snowpark pandas lazy evaluation paradigm, every time this {object_name} is used, it will be recomputed - | ||
causing every subsequent operation on this {object_name} to re-perform the 30 unions required to produce it. | ||
This makes subsequent operations more expensive. The `cache_result` API can be used to persist the | ||
{object_name} to a temporary table for the duration of the session - replacing the nested 30 unions with a single | ||
read from a table. | ||
>>> new_{object_var_name} = {object_var_name}.cache_result() | ||
>>> import numpy as np | ||
>>> np.all((new_{object_var_name} == {object_var_name}).values) | ||
True | ||
>>> {object_var_name}.reset_index(drop=True, inplace=True) # Slower | ||
>>> new_{object_var_name}.reset_index(drop=True, inplace=True) # Faster | ||
""" | ||
|
||
|
||
def add_cache_result_docstring(func: Fn) -> Fn: | ||
""" | ||
Decorator to add docstring to cache_result method. | ||
""" | ||
# In this case, we are adding the docstring to Series.cache_result. | ||
if "series" in func.__module__: | ||
object_name = "Series" | ||
examples_portion = cache_result_examples.format( | ||
object_name=object_name, | ||
object_var_name="series", | ||
object_creation_call="pd.concat([pd.Series([i]) for i in range(30)])", | ||
) | ||
else: | ||
object_name = "DataFrame" | ||
examples_portion = cache_result_examples.format( | ||
object_name=object_name, | ||
object_var_name="df", | ||
object_creation_call="pd.concat([pd.DataFrame([range(i, i+5)]) for i in range(0, 150, 5)])", | ||
) | ||
func.__doc__ = cache_result_docstring.format( | ||
object_name=object_name, examples=examples_portion | ||
) | ||
return func |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.