-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] do not copy column-major numpy arrays when creating Dataset from list of arrays #6773
base: master
Are you sure you want to change the base?
Conversation
…Dataset from list of arrays
This is a breaking change in the C API, so I can maybe split this into two PRs, one making the change to the C API only and mark it as a breaking change, and then implement the logic in the python package and mark it as efficiency. WDYT @StrikerRUS @jameslamb? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. I think we should NOT break the ABI just for this purpose. What do you think about introducing a new function in the C API like LGBM_DatasetCreateFromMatsMixedLayouts()
?
Most of the implementation of LGBM_DatasetCreateFromMats()
could be moved into this function, and then LGBM_DatasetCreateFromMats()
could internally construct a vector of is_row_major
and call this new function.
You will probably have a better recommendation for the name, but what do you think about this general idea?
In my opinion it's worth growing the C API a little bit in exchange for not breaking other projects that link against lib_lightgbm
. I found a few of those on GitHub (with this search):
lightgbm-rs
(code link)feedzai-openml-java
(code link)lightgbm4j
(code link)LightGBM.NET
(code link)
And we can use the docs on the function to clearly describe what the difference is.
Even if we do decide to keep this as a breaking change, I'd be ok with it being one PR. The changes are so tightly related to each other and the total size of code changes is fairly small. I think it's ok to have a single one under the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much, LGTM!
I think that one breaking PR is OK. We already have some breaking changes for next release , so this tiny breakage shouldn't be the only problem for other packages maintainers. They will already have to check Breaking section of changelog carefully, and changing int
to int[]
shouldn't be a big headache.
Ok, let's leave it as an ABI-breaking change. We are not making ABI stability guarantees here yet anyway. I do think we should consider that at some point... maybe as part of #6774, we could start using the |
Co-authored-by: Nikita Titov <[email protected]>
Following the suggestion in #6751 (comment) this changes the signature of the C API
LGBM_DatasetCreateFromMats
function to take an array of integers asis_row_major
, indicating the layout of each matrix. Also updates thelightgbm.Dataset.__init_from_list_np2d
method to provide this setting and avoid copying the arrays in the list.