Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train over input that is >> larger than RAM? #138

Open
Moelf opened this issue Mar 9, 2022 · 2 comments
Open

How to train over input that is >> larger than RAM? #138

Moelf opened this issue Mar 9, 2022 · 2 comments

Comments

@Moelf
Copy link
Contributor

Moelf commented Mar 9, 2022

I wonder if there's a way to iteratively train over chunks of input data (or even row by row), manually. We deal with data much larger than RAM and also doesn't fit the table interface -- in short, each "row" can contain many variables, some are vectors with un-fixed length, so we need to compute input to EvoTree on the fly.

@Moelf Moelf changed the title How to train over input thatis >> RAM? How to train over input that is >> larger than RAM? Mar 9, 2022
@jeremiedb
Copy link
Member

Support for out of memory data is a feature I'd like to see supported.

Do you have constraints with regard to the storage format of the data? On top of my mind, I'd think of working out of DTable: https://juliaparallel.github.io/Dagger.jl/stable/dtable/ and perhaps integrate with a DataLoader interface if needed. I understand your source data is in another format, yet I can hardly image a totally arbitrary data loader, as boosted trees algorithm assumes that all variables/features are consistently available to all data points.

Would it be reasonable to perform a preprocessing step on your data to bring it in a more structured form like DTable?

@Moelf
Copy link
Contributor Author

Moelf commented Mar 11, 2022

Do you have constraints with regard to the storage format of the data?

yes, it's CERN ROOT, and we wrote https://github.com/tamasgal/UnROOT.jl from scratch. Physically (on disk), it's a bit like Apache Parquet.

a DataLoader interface

yeah well I don't think I can just make a DTable, because variables I'd like to use for BDT is not available in the file, and it's non-trivial selection/transformation to make those on the fly. (but we still need to make them on the fly, staging files is just too cumbersome).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants