Replies: 2 comments
-
@cfkstat thanks for your suggestion. Currently we do not support polars input for the Note however that you may use the Our website is evolving and soon we'll have a lot more resources on how to use the efficient core API. |
Beta Was this translation helpful? Give feedback.
-
Thank you, that’s an interesting suggestion @cfkstat, and we’ll definitely look into it! In the meantime, let me provide some context about Khiops and how it’s designed to handle large-scale data processing efficiently. Khiops has been developed and refined for over 20 years, with a strong focus on natively scalable processing of large datasets (particularly multi-table ones) to handle the massive, multi-table use cases we face at Orange as a telecommunications company, where user logs generate immense volumes of data. So, in a nutshell: letting Khiops handle reading and writing (via the core API, for example) ensures efficient execution. As @folmos-at-orange mentioned: while the khiops-python sklearn wrapper is ideal for discovery, it doesn’t fully showcase the performance of khiops-core. You can read more about how Khiops handles massive datasets even out-of-core (see hardware adaptation). We will soon document how to leverage the performance of the core API and how to move to production using Khiops (which we have been doing at Orange for years). Khiops supports direct data access via S3 and GCS drivers, delivering high performance for cloud-stored datasets. Additionally, Khiops is optimized for Kubernetes, enabling distributed computation and leveraging native file I/O to maximize scalability. I finish with a short example: in a project at Orange, we use Khiops in production to process billions of user records (hundreds of Gb), spread across 5 tabular files (users info, calls in/out, SMS in/out), each with hundreds of columns. Training a model—including the time-consuming automated feature engineering to generate 10,000 aggregates—takes just one hour on a single machine (it would be even faster on a K8s cluster). Benchmarking against Polar data to evaluate performance on such use cases could be an interesting exercise to assess whether investing in new developments is worthwhile (adopting new data formats requires a significant time investment). With technologies evolving rapidly—and sometimes proving short-lived—it’s not always straightforward to commit resources to integrating them. |
Beta Was this translation helpful? Give feedback.
-
It's better to consider adding support for polar data, writing files, sorting, and more efficient execution than pandas. Trying to perform calculations on millions of data consumes too much time for data reading and writing.
Beta Was this translation helpful? Give feedback.
All reactions