Apache Arrow Support #4749

usbrandon · 2024-12-29T20:06:47Z

usbrandon
Dec 29, 2024
Collaborator

Greetings. Please for renewed consideration, #2556 could we evaluate using Apache Arrow as the bridge between the JVM and Python process in the CPython transform?

More broadly there are probably connectivity efficiency gains to be had in connecting to more modern data sources that support ADBC
https://arrow.apache.org/docs/format/ADBC.html

Feb 2024 - Snowflake, BigQuery, Postgres, SQLite, and Pandas (think CPython 30x to 80x faster) transporting data using Arrow.
https://voltrondata.com/blog/go-inside-the-arrow-database-connectivity-roadmap-background-and-community?utm_source=chatgpt.com

Perhaps just border to border level changes might make it easy to take incremental steps towards Arrow. Being a pragmatist, I would try to prioritize the enhancement for the Cpython step to get rid of the "Server" process it spins up that can die / hang if any data is not escaped in the dataframe or variables passed between Hop and the Python process running outside the JVM. This is a matter of stability, not just data transport. A company that will not be named used py4j that came out before Arrow. It uses sockets and ports to facilitate transfers between JVM / Python Processes, but I feel like Arrow will be more portable and is not such an outlier project like VFS etc. I sense greater longevity and process with Apache Arrow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Arrow Support #4749

{{title}}

Replies: 0 comments

Select a reply

Apache Arrow Support #4749

usbrandon Dec 29, 2024 Collaborator

Replies: 0 comments

usbrandon
Dec 29, 2024
Collaborator