How about using polars replace pandas #471

cfkstat · 2024-11-21T03:31:40Z

cfkstat
Nov 21, 2024

It's better to consider adding support for polar data, writing files, sorting, and more efficient execution than pandas. Trying to perform calculations on millions of data consumes too much time for data reading and writing.

folmos-at-orange · 2024-11-22T16:08:15Z

folmos-at-orange
Nov 22, 2024
Maintainer

@cfkstat thanks for your suggestion.

Currently we do not support polars input for the khiops.sklearn estimators. We have made an issue of it and we'll implement it according to our priorities and availability.

Note however that you may use the khiops.core API that does not use any in-memory table for input (only files) and it is very efficient in their treatment.

Our website is evolving and soon we'll have a lot more resources on how to use the efficient core API.

0 replies

lucaurelien · 2024-11-25T09:36:38Z

lucaurelien
Nov 25, 2024
Maintainer

Thank you, that’s an interesting suggestion @cfkstat, and we’ll definitely look into it! In the meantime, let me provide some context about Khiops and how it’s designed to handle large-scale data processing efficiently.

Khiops has been developed and refined for over 20 years, with a strong focus on natively scalable processing of large datasets (particularly multi-table ones) to handle the massive, multi-table use cases we face at Orange as a telecommunications company, where user logs generate immense volumes of data. So, in a nutshell: letting Khiops handle reading and writing (via the core API, for example) ensures efficient execution.

As @folmos-at-orange mentioned: while the khiops-python sklearn wrapper is ideal for discovery, it doesn’t fully showcase the performance of khiops-core. You can read more about how Khiops handles massive datasets even out-of-core (see hardware adaptation).

We will soon document how to leverage the performance of the core API and how to move to production using Khiops (which we have been doing at Orange for years). Khiops supports direct data access via S3 and GCS drivers, delivering high performance for cloud-stored datasets. Additionally, Khiops is optimized for Kubernetes, enabling distributed computation and leveraging native file I/O to maximize scalability.

I finish with a short example: in a project at Orange, we use Khiops in production to process billions of user records (hundreds of Gb), spread across 5 tabular files (users info, calls in/out, SMS in/out), each with hundreds of columns. Training a model—including the time-consuming automated feature engineering to generate 10,000 aggregates—takes just one hour on a single machine (it would be even faster on a K8s cluster).

Benchmarking against Polar data to evaluate performance on such use cases could be an interesting exercise to assess whether investing in new developments is worthwhile (adopting new data formats requires a significant time investment). With technologies evolving rapidly—and sometimes proving short-lived—it’s not always straightforward to commit resources to integrating them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Khiops

How about using polars replace pandas #471

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Khiops

How about using polars replace pandas #471

cfkstat Nov 21, 2024

Replies: 2 comments

folmos-at-orange Nov 22, 2024 Maintainer

lucaurelien Nov 25, 2024 Maintainer

cfkstat
Nov 21, 2024

folmos-at-orange
Nov 22, 2024
Maintainer

lucaurelien
Nov 25, 2024
Maintainer