How to train over input that is >> larger than RAM? #138

Moelf · 2022-03-09T06:51:55Z

I wonder if there's a way to iteratively train over chunks of input data (or even row by row), manually. We deal with data much larger than RAM and also doesn't fit the table interface -- in short, each "row" can contain many variables, some are vectors with un-fixed length, so we need to compute input to EvoTree on the fly.

jeremiedb · 2022-03-11T01:26:34Z

Support for out of memory data is a feature I'd like to see supported.

Do you have constraints with regard to the storage format of the data? On top of my mind, I'd think of working out of DTable: https://juliaparallel.github.io/Dagger.jl/stable/dtable/ and perhaps integrate with a DataLoader interface if needed. I understand your source data is in another format, yet I can hardly image a totally arbitrary data loader, as boosted trees algorithm assumes that all variables/features are consistently available to all data points.

Would it be reasonable to perform a preprocessing step on your data to bring it in a more structured form like DTable?

Moelf · 2022-03-11T01:37:12Z

Do you have constraints with regard to the storage format of the data?

yes, it's CERN ROOT, and we wrote https://github.com/tamasgal/UnROOT.jl from scratch. Physically (on disk), it's a bit like Apache Parquet.

a DataLoader interface

yeah well I don't think I can just make a DTable, because variables I'd like to use for BDT is not available in the file, and it's non-trivial selection/transformation to make those on the fly. (but we still need to make them on the fly, staging files is just too cumbersome).

Moelf changed the title ~~How to train over input thatis >> RAM?~~ How to train over input that is >> larger than RAM? Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train over input that is >> larger than RAM? #138

How to train over input that is >> larger than RAM? #138

Moelf commented Mar 9, 2022 •

edited

Loading

jeremiedb commented Mar 11, 2022

Moelf commented Mar 11, 2022 •

edited

Loading

How to train over input that is >> larger than RAM? #138

How to train over input that is >> larger than RAM? #138

Comments

Moelf commented Mar 9, 2022 • edited Loading

jeremiedb commented Mar 11, 2022

Moelf commented Mar 11, 2022 • edited Loading

Moelf commented Mar 9, 2022 •

edited

Loading

Moelf commented Mar 11, 2022 •

edited

Loading