Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Closed
ilyanoskov opened this issue Feb 5, 2024 · 2 comments · Fixed by #211
Closed
Labels
Arrow Apache Arrow support

Comments

@ilyanoskov
Copy link

I have recently had this case, where I had to process a Pandas dataframe with 70M rows that had 5 simple columns and used window functions and GROUP BY operations.

After saving this data to CSV / Parquet and then processing it, CHDB was able to compute the results in 4-5 seconds, and when operating over Arrow, it took close to 30 seconds.

Steps to reproduce this are simple: create a dataframe with random data over 5 columns (id, time, val1, val2, val3) for 70M rows and then perform complex GROUP BY / WINDOW operations. Then save this dataframe to a file and perform the same queries over the file. You will see that the performance is significantly faster.

I would imagine that working with Arrow dataframes would be faster, since accessing memory is faster than accessing disk?

@auxten
Copy link
Member

auxten commented Feb 6, 2024

It's discussed on #187. I'm working on it.

@auxten auxten added the Arrow Apache Arrow support label Mar 11, 2024
@auxten auxten added this to chDB Q2 Apr 17, 2024
@auxten auxten moved this to In Progress in chDB Q2 Apr 17, 2024
@auxten auxten moved this from In Progress to Todo in chDB Q2 Apr 17, 2024
@auxten
Copy link
Member

auxten commented Jul 24, 2024

Faster path of query on ArrowTable is done on v2.0.0b1
Example: https://github.com/chdb-io/chdb/blob/main/tests/test_query_py.py#L94

@auxten auxten moved this from Todo to In Progress in chDB Q2 Jul 24, 2024
@auxten auxten linked a pull request Aug 30, 2024 that will close this issue
@auxten auxten closed this as completed Aug 30, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in chDB Q2 Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow Apache Arrow support
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants