Replies: 2 comments 2 replies
-
Hey @rjzamora! My guess here is that the root cause of this is: Here's a fix I just put together: #2780 Feel free to try this instead, I'd be curious to see if this performs better for you and this is essentially the approach that the fix I proposed above takes: df = daft.read_parquet("s3://...")
arrow_rb_iters = df.to_arrow_iter()
table = pa.Table.from_batches(arrow_rb_iters) |
Beta Was this translation helpful? Give feedback.
-
I also ran this through the rest of the team, and we think that it also might be the case that Some other notes for benchmarking that you might find helpful:
Generally speaking, if the goal here is to retrieve data and pipe it into CuDF, I'm guessing that |
Beta Was this translation helpful? Give feedback.
-
Hello Daft experts,
Congrats on all the great work you are doing with Daft - Cool stuff!
I have a pretty basic question about the expected performance of converting a Daft DataFrame to an Arrow Table. Hopefully this is the right place to ask.
Background: I was curious if I could use Daft to reduce the IO bottleneck of a cuDF workflow that needs to read multiple parquet files from S3 into GPU memory. Since both Daft and cuDF offer simple APIs to convert data to/from Arrow, I started by benchmarking the performance of
daft.read_parquet(paths).to_arrow()
. While running on a simple p3.2xlarge instance, I found the performance ofdf.collect()
to be at least 3x faster thandf.to_arrow()
(~50MBps vs ~150MBps). When I scale out using multiple processes, the performance delta was even larger (1-2Gbps vs 10+Gbps).Question: Is
df.to_arrow()
expected to have a large overhead compared todf.collect()
? What makes thecollect
operation so much faster (sometimes faster than the advertised network bandwidth)?I realize Daft is not intended to be used it the way I am using it, so it's perfectly fine if your answer reflects that reality :)
Beta Was this translation helpful? Give feedback.
All reactions