Does Ibis convert everything to Pandas all the time? #7726

jettdc · 2023-12-12T18:08:46Z

jettdc
Dec 12, 2023

Looking through the source code, it looks like some datasets are converted to and from pandas when they are sent to a backend and received from a backend, such as here in pyspark execute:

https://github.com/ibis-project/ibis/blob/2365e10240ec067ea5b04a9517833e56014f7819/ibis/backends/pyspark/__init__.py#L216C4-L222C67

or here for the polars execute:

ibis/ibis/backends/polars/__init__.py

Lines 417 to 425 in 2365e10

    
           if isinstance(expr, (ir.Table, ir.Scalar)): 
        
               return expr.__pandas_result__(df.to_pandas()) 
        
           else: 
        
               assert isinstance(expr, ir.Column), type(expr) 
        
               if expr.type().is_temporal(): 
        
                   return df.to_pandas().iloc[:, 0] 
        
               else: 
        
                   # note: skip frame-construction overhead 
        
                   return df.to_series().to_pandas()

Is this not incredibly inefficient to try to shove a potentially massive pyspark (or similar distributed) dataset into an in memory pandas table? Am I missing something here?

Answered by gforsyth

Dec 12, 2023

Hey @jettdc ! So, short answer is "No."

Long answer:

execute is a bit of a holdover from previous versions, but we didn't want to remove it entirely since it would break just about every users existing Ibis code.

execute today is equivalent to the to_pandas method on an expression -- there's also a to_pyarrow option for most, if not all, backends.

It would be very inefficient to shove a massive pyspark dataset into an in-memory pandas dataframe. We envision calling to_pandas and to_pyarrow as the final step on an expression. Unless that expression is equivalent to SELECT * FROM ... then the various filters and predicates in the expression should reduce the size of the output considerably.…

View full answer

gforsyth · 2023-12-12T18:20:30Z

gforsyth
Dec 12, 2023
Maintainer

Hey @jettdc ! So, short answer is "No."

Long answer:

execute is a bit of a holdover from previous versions, but we didn't want to remove it entirely since it would break just about every users existing Ibis code.

execute today is equivalent to the to_pandas method on an expression -- there's also a to_pyarrow option for most, if not all, backends.

It would be very inefficient to shove a massive pyspark dataset into an in-memory pandas dataframe. We envision calling to_pandas and to_pyarrow as the final step on an expression. Unless that expression is equivalent to SELECT * FROM ... then the various filters and predicates in the expression should reduce the size of the output considerably. We only convert things to pandas or pyarrow at user request and so that we can display those results to the user (since otherwise they're locked away in either a sqlalchemy cursor, or a spark or polars object.

Since we execute all of the expressions directly on the given backend, those should be quite fast, and it's only displaying output to the user that incurs the pandas conversion penalty, but that can also be very quick.

For backends that expose native arrow conversion capabilities, we also have to_pyarrow_batches that will hand off a PyArrow RecordBatchReader that can be used to stream results efficiently from one place to another.

Hopefully that clarifies things a bit, happy to explain more if I've missed something.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Ibis convert everything to Pandas all the time? #7726

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Does Ibis convert everything to Pandas all the time? #7726

jettdc Dec 12, 2023

Replies: 1 comment

gforsyth Dec 12, 2023 Maintainer

jettdc
Dec 12, 2023

gforsyth
Dec 12, 2023
Maintainer