Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Unexpected behaviour if pyarrow is installed, but cffi is not #6782

Open
mlondschien opened this issue Jan 9, 2025 · 3 comments
Labels

Comments

@mlondschien
Copy link
Contributor

I came across this when debugging #6780.

If pyarrow is installed, but cffi is not, this import

from pyarrow.cffi import ffi as arrow_cffi

raises, such that pa_Table is a dummy class and the comparison isinstance(data, pa_Table) returns False, and so does _is_pyarrow_table. If fitting on a lgbm.Dataset(data=pyarrow.Table), we thus don't evalute __init_from_pyarrow_table here
elif _is_pyarrow_table(data):
self.__init_from_pyarrow_table(data, params_str, ref_dataset)
feature_name = data.column_names

but rather jump into the try / except here:
try:
csr = scipy.sparse.csr_matrix(data)
self.__init_from_csr(csr, params_str, ref_dataset)
except BaseException as err:
raise TypeError(f"Cannot initialize Dataset from {type(data).__name__}") from err

This works, but I am not sure whether this is intended behaviour.

One approach to solve this would be to separate the pyarrow and cffi imports in compat.py. However, then, jumping into self.__init_from_pyarrow_table(data, params_str, ref_dataset) fails, as this requires cffi to be installed.

The underlying problem here is that both cffi and pyarrow need to be installed to "natively" fit from a pyarrow.Table. However, this is not communicated to the user. One approach would be to separate as below, add a and CFFI_INSTALLED to the elif _is_pyarrow_table(data): row and raise a warning if pyarrow is installed by cffi is not.

"""pyarrow"""
try:
    import pyarrow.compute as pa_compute
    from pyarrow import Array as pa_Array
    from pyarrow import ChunkedArray as pa_ChunkedArray
    from pyarrow import Table as pa_Table
    from pyarrow import chunked_array as pa_chunked_array
    from pyarrow.types import is_boolean as arrow_is_boolean
    from pyarrow.types import is_floating as arrow_is_floating
    from pyarrow.types import is_integer as arrow_is_integer

    PYARROW_INSTALLED = True
except ImportError:
    PYARROW_INSTALLED = False

    class pa_Array:  # type: ignore
        """Dummy class for pa.Array."""

        def __init__(self, *args: Any, **kwargs: Any):
            pass

    class pa_ChunkedArray:  # type: ignore
        """Dummy class for pa.ChunkedArray."""

        def __init__(self, *args: Any, **kwargs: Any):
            pass

    class pa_Table:  # type: ignore
        """Dummy class for pa.Table."""

        def __init__(self, *args: Any, **kwargs: Any):
            pass

    class pa_compute:  # type: ignore
        """Dummy class for pyarrow.compute."""

        all = None
        equal = None

    pa_chunked_array = None
    arrow_is_boolean = None
    arrow_is_integer = None
    arrow_is_floating = None

"""cffi"""
try:
    from pyarrow.cffi import ffi as arrow_cffi
    CFFI_INSTALLED = True
except ImportError:
    CFFI_INSTALLED = False
    class arrow_cffi:  # type: ignore
        """Dummy class for pyarrow.cffi.ffi."""

        CData = None
        addressof = None
        cast = None
        new = None

    def __init__(self, *args: Any, **kwargs: Any):
        pass
@jameslamb jameslamb changed the title Unexpected behaviour if pyarrow is installed, but cffi is not [pythong-package] Unexpected behaviour if pyarrow is installed, but cffi is not Jan 9, 2025
@jmoralez
Copy link
Collaborator

jmoralez commented Jan 9, 2025

Hey @mlondschien, thanks for using LightGBM.

How did you install the library? The arrow extra takes this into account

arrow = [
"cffi>=1.15.1",
"pyarrow>=6.0.1"
]

@mlondschien
Copy link
Contributor Author

mlondschien commented Jan 9, 2025

I installed pyarrow from conda-forge on an osx-arm64. It appears that cffi is not a runtime dependency of pyarrow (feedstock).

I installed pyarrow, lightgbm, and polars with conda from conda-forge.

conda create -n lgbm
mamba install -y lightgbm, polars, pyarrow

@jameslamb jameslamb changed the title [pythong-package] Unexpected behaviour if pyarrow is installed, but cffi is not [python-package] Unexpected behaviour if pyarrow is installed, but cffi is not Jan 9, 2025
@jameslamb
Copy link
Collaborator

jameslamb commented Jan 11, 2025

Thanks for this report and the thorough investigation!
I agree with you... it's hard to figure this out as a user of the library.

We've been relying on the [arrow] extra and corresponding documentation (as @jmoralez mentioned) to communicate the fact that you need both pyarrow and cffi, but as you're saying, that doesn't help for conda-forge::lightgbm, where pyarrow and cffi are (intentionally!) not required dependencies.

I'm supportive of your proposal to split up the imports in compat.py so that lightgbm can provide a more informative message.

However, instead of a warning, I think cases like these:

if not PYARROW_INSTALLED:
raise LightGBMError("Cannot predict from Arrow without `pyarrow` installed.")

if not PYARROW_INSTALLED:
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` installed.")

Should still be exceptions, but like this:

 if not (CFFI_INSTALLED and PYARROW_INSTALLED): 
     raise LightGBMError("Cannot predict on Arrow data without 'cffi' and 'pyarrow' installed.") 

In the case you mentioned (Dataset construction where you have pyarrow installed but not cffi, and pass a pyarrow.Table), this condition would be True:

elif _is_pyarrow_table(data):

So you'd end up with this proposed more-informative error instead of this one:

except BaseException as err:
raise TypeError(f"Cannot initialize Dataset from {type(data).__name__}") from err

@jmoralez what do you think about this idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants