CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

ilyanoskov · 2024-02-05T19:18:01Z

I have recently had this case, where I had to process a Pandas dataframe with 70M rows that had 5 simple columns and used window functions and GROUP BY operations.

After saving this data to CSV / Parquet and then processing it, CHDB was able to compute the results in 4-5 seconds, and when operating over Arrow, it took close to 30 seconds.

Steps to reproduce this are simple: create a dataframe with random data over 5 columns (id, time, val1, val2, val3) for 70M rows and then perform complex GROUP BY / WINDOW operations. Then save this dataframe to a file and perform the same queries over the file. You will see that the performance is significantly faster.

I would imagine that working with Arrow dataframes would be faster, since accessing memory is faster than accessing disk?

auxten · 2024-02-06T04:21:00Z

It's discussed on #187. I'm working on it.

auxten · 2024-07-24T11:42:41Z

Faster path of query on ArrowTable is done on v2.0.0b1
Example: https://github.com/chdb-io/chdb/blob/main/tests/test_query_py.py#L94

auxten added the Arrow Apache Arrow support label Mar 11, 2024

auxten added this to chDB Q2 Apr 17, 2024

auxten moved this to In Progress in chDB Q2 Apr 17, 2024

auxten moved this from In Progress to Todo in chDB Q2 Apr 17, 2024

auxten moved this from Todo to In Progress in chDB Q2 Jul 24, 2024

auxten linked a pull request Aug 30, 2024 that will close this issue

Read in process Python objects like Dataframe, Numpy or dict #211

Merged

auxten closed this as completed Aug 30, 2024

github-project-automation bot moved this from In Progress to Done in chDB Q2 Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

ilyanoskov commented Feb 5, 2024

auxten commented Feb 6, 2024

auxten commented Jul 24, 2024

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Comments

ilyanoskov commented Feb 5, 2024

auxten commented Feb 6, 2024

auxten commented Jul 24, 2024