-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate polars and numba in aggregations #370
Comments
This is an interesting idea. Would be good to do a quick test in the 40-yr retro that is in TEEHR-HUB and see how the performance varies compared to PySpark. Assuming it would be a lot faster but would be good to understand how much. |
For some reason I thought Polars did not support this multi column aggregates, but I guess I was wrong. |
I thought the same thing and just kind of stumbled on this. The multiple column approach is listed in the UDF section of the rust docs and not in the aggregation section so it's easy to miss, also not sure when this was added I did some query testing in teehr-hub (large instance), calculating KGE grouping by e1_camels_daily_streamflow
e2_camels_hourly_streamflow
40-yr retrospective (pointing polars to playground/mdenno/40-yr-retrospective/dataset/joined/**/*.parquet)
Polars is generally faster on smaller datasets, seems to rely more on memory. PySpark more efficient on 40-yr. I had to set Also, using a numba function for KGE in pyspark did not improve performance (actually a bit slower), could be that the KGE function is already vectorized due to numpy. |
This is great. Have you given thought to if/how we could possibly support multiple backends? Would be really awesome to have a faster option. Assuming polars also supports grouping and filtering. Seems like all the pieces are there. I think pandera supports polars too. |
I've had good luck with polars in the past and found it to be far more effective than pandas. I did notice that it seemed to have a lot of 'gotchas' - not necessarily things that break functionality but that there are 'wrong' ways to do things. Breaking up operations into easily readable chunks, using your own functions, and mixing in other libraries (aside from numba) can end up slowing you down. An implementation of KGE I was last toying around with ended up looking like: beta_label: str = "beta"
beta_column = polars.col(beta_label)
gamma_label: str = "gamma"
gamma_column = polars.col(gamma_label)
kge: polars.Expr = (
1 - ((CORRELATION_COLUMN - 1)**2 + (beta_column - 1)**2 + (gamma_column - 1)**2).sqrt()
).alias(identifier)
intermediary_data: polars.DataFrame = data.with_columns(
(MEAN_SIMULATION_COLUMN / MEAN_CONTROL_COLUMN).alias(beta_label),
(SIMULATION_STANDARD_DEVIATION_COLUMN / CONTROL_STANDARD_DEVIATION_COLUMN).alias(gamma_label)
)
intermediary_data = intermediary_data.with_columns(
kge
)
It very much wants you to use I forgot what the gotchas were with numba, but I remember running into issues with data types not being compatible. That might have been due to me not being used to it. I performed some experiments on the retrospective and the bottleneck, by far, was data retrieval. |
Some research notes: We may have already explored this, but aggregations in
polars
can take UDF's accessing multiple dataframe columns. It can perform operations lazily, and supports method chaining. Seems like it could potentially fit into our current framework fairly easily if we wanted an alternate backend to pyspark, if horizontal scaling is not needed. (API ?)We can also take advantage of numba decorators in the metric functions to potentially improve query performance (in our current framework and/or with polars).
I haven't done any benchmarking, but wanted to add some notes here.
Example:
We can use a wrapper around the functions to implement numba in our current pyspark framework:
The text was updated successfully, but these errors were encountered: