-
Looking through the source code, it looks like some datasets are converted to and from pandas when they are sent to a backend and received from a backend, such as here in pyspark execute: or here for the polars execute: ibis/ibis/backends/polars/__init__.py Lines 417 to 425 in 2365e10 Is this not incredibly inefficient to try to shove a potentially massive pyspark (or similar distributed) dataset into an in memory pandas table? Am I missing something here? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hey @jettdc ! So, short answer is "No." Long answer:
It would be very inefficient to shove a massive pyspark dataset into an in-memory pandas dataframe. We envision calling Since we execute all of the expressions directly on the given backend, those should be quite fast, and it's only displaying output to the user that incurs the pandas conversion penalty, but that can also be very quick. For backends that expose native arrow conversion capabilities, we also have Hopefully that clarifies things a bit, happy to explain more if I've missed something. |
Beta Was this translation helpful? Give feedback.
Hey @jettdc ! So, short answer is "No."
Long answer:
execute
is a bit of a holdover from previous versions, but we didn't want to remove it entirely since it would break just about every users existing Ibis code.execute
today is equivalent to theto_pandas
method on an expression -- there's also ato_pyarrow
option for most, if not all, backends.It would be very inefficient to shove a massive pyspark dataset into an in-memory pandas dataframe. We envision calling
to_pandas
andto_pyarrow
as the final step on an expression. Unless that expression is equivalent toSELECT * FROM ...
then the various filters and predicates in the expression should reduce the size of the output considerably.…