-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Dataset from Arrow format #3369
Comments
Does it mean that right now one can use MMLSpark (https://github.com/Azure/mmlspark) for Arrow + LightGBM (similarly to parquet #1286 (comment))? |
possibly! But that definitely does not satisfy this feature. Spark is a heavy dependency that many users are unlikely to have access to. |
I think @shiyu1994 can help with this. He has some ideas to refine the dataset class recently. |
I think it would be other way round. If LightGBM implemented datasetFromArrow, it would probably be useful to speed up / improve efficiency from within MMLSpark |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
For the reference: Parquet data reader implementation in XGBoost with optional Arrow dependency at compile time. |
Linking the eventual XGBoost implementation: dmlc/xgboost#7512 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, this was locked accidentally. Just unlocked it. |
Summary
Apache Arrow is an open source columnar in-memory storage format that's well-suited to tabular data. It offers efficient data loading from files or other input streams, and zero-copy data sharing between languages.
Motivation
I think that this feature could allow for faster data loading, esp. from the parquet and CSV file formats. It would also allow directly training on Arrow tables, so we might be able to avoid some data copying in language wrappers (e.g. converting to a
pandas
data frame or Rdata.frame
).pyarrow
offers a fast, efficient Parquet reader. I believe that reading from Parquet files directly into Arrow, then being able to efficiently create a LightGBM Dataset from thatpyarrow
table, would allow for faster I/O and better memory efficiency by avoiding the need to ever create apandas
data frame: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html.Description
I'm admittedly not very experienced with C++, so maybe others can expand this description. But basically, I think it would involve adding a
LGBM_DatasetCreateFromArrow
similar toLGBM_DatasetCreateFromCSV
:LightGBM/src/c_api.cpp
Line 1245 in 82e2ff7
Arrow is a fairly heavy dependency (and
pyarrow
in Python /{arrow}
in R, by extension), so an implementation should also explore how to make these optional for users who do not need the Arrow features.References
There is an in-progress PR to add this feature to XGBoost: dmlc/xgboost#5667
Spark added support for Arrow as a memory representation in pyspark 3 years ago: https://arrow.apache.org/blog/2017/07/26/spark-arrow/.
The text was updated successfully, but these errors were encountered: