Expected performance of `df.to_arrow()` #2778

rjzamora · 2024-09-03T19:32:49Z

rjzamora
Sep 3, 2024

Hello Daft experts,

Congrats on all the great work you are doing with Daft - Cool stuff!

I have a pretty basic question about the expected performance of converting a Daft DataFrame to an Arrow Table. Hopefully this is the right place to ask.

Background: I was curious if I could use Daft to reduce the IO bottleneck of a cuDF workflow that needs to read multiple parquet files from S3 into GPU memory. Since both Daft and cuDF offer simple APIs to convert data to/from Arrow, I started by benchmarking the performance of daft.read_parquet(paths).to_arrow(). While running on a simple p3.2xlarge instance, I found the performance of df.collect() to be at least 3x faster than df.to_arrow() (~50MBps vs ~150MBps). When I scale out using multiple processes, the performance delta was even larger (1-2Gbps vs 10+Gbps).

Question: Is df.to_arrow() expected to have a large overhead compared to df.collect()? What makes the collect operation so much faster (sometimes faster than the advertised network bandwidth)?

I realize Daft is not intended to be used it the way I am using it, so it's perfectly fine if your answer reflects that reality :)

jaychia · 2024-09-03T21:24:26Z

jaychia
Sep 3, 2024
Maintainer

Hey @rjzamora!

My guess here is that the root cause of this is: df.to_arrow() returns an Arrow table instead of an iterator of Arrow RecordBatch. However, today Daft incurs the cost of naively concatting these chunks of data together, and that's making your workload much slower.

Here's a fix I just put together: #2780

Feel free to try this instead, I'd be curious to see if this performs better for you and this is essentially the approach that the fix I proposed above takes:

df = daft.read_parquet("s3://...")
arrow_rb_iters = df.to_arrow_iter()
table = pa.Table.from_batches(arrow_rb_iters)

1 reply

rjzamora Sep 4, 2024
Author

Thanks so much for the quick reply @jaychia !

I upgraded to the nightly package since this approach doesn't work for 0.3.1 (to_arrow_iter was returning pa.Table iterator). Everything works fine with the nightly package, but I haven't gotten a chance to run benchmarks from EC2 yet. I still see a significant performance delta when reading over the internet, but the network I'm on is way too inconsistent to draw any conclusions:

Using daft Reader...
...Read 231.898943 MB in 3.461006388068199 s...                                                                                                                                     
...Read 229.30035 MB in 0.8225592710077763 s...                                                                                                                                     
...Read 232.239883 MB in 0.8461974672973156 s...                                                                                                                                    
...Read 229.236431 MB in 0.7677622269839048 s...                                                                                                                                    
...Read 232.132577 MB in 0.7717492952942848 s...                                                                                                                                    
With daft Reader...
Summary (5 trials): avg=243.91678975356385, std=89.06980327144512, min=67.0033270667949, max=300.787546442116 (MB/s)

Using daft[to_arrow] Reader...
...Read 231.898943 MB in 25.797238525003195 s...                                                                                                                                    
...Read 229.30035 MB in 23.67980762757361 s...                                                                                                                                      
...Read 232.239883 MB in 20.680157240480185 s...                                                                                                                                    
...Read 229.236431 MB in 14.500003537163138 s...                                                                                                                                    
...Read 232.132577 MB in 11.831652492284775 s...                                                                                                                                    
With daft[to_arrow] Reader...
Summary (5 trials): avg=13.066355258346253, std=4.047433121194437, min=8.989293283280649, max=19.61962432140141 (MB/s)

[EDIT: This is using the pa.Table.from_batches workaround you suggested, not 2780]

I'll follow up when I have "real" numbers to share. Let me know if there is anything you'd like me to experiment with when I do collect those numbers. Thanks again for your help.

jaychia · 2024-09-04T20:38:17Z

jaychia
Sep 4, 2024
Maintainer

I also ran this through the rest of the team, and we think that it also might be the case that .collect() is performing a lazy load of the Parquet data (only loading the metadata)! Your comment (sometimes faster than the advertised network bandwidth) makes me think that this might be what is happening.

Some other notes for benchmarking that you might find helpful:

Running via EC2 (and specifically from a VPC in the same region as your bucket, and with an S3 endpoint set up) is probably the most consistent way to get this benchmarking done!
Passing in a list of paths or a glob "s3://.../**" into daft.read_parquet(...) allows Daft to be most efficient by batching/splitting rowgroup reads as necessary

Generally speaking, if the goal here is to retrieve data and pipe it into CuDF, I'm guessing that .to_arrow_iter() is likely the API of choice and should be what we'll want to benchmark!

1 reply

rjzamora Sep 4, 2024
Author

we think that it also might be the case that .collect() is performing a lazy load of the Parquet data (only loading the metadata)! Your comment (sometimes faster than the advertised network bandwidth) makes me think that this might be what is happening.

Ah okay - this could certainly be the case.

Running via EC2 (and specifically from a VPC in the same region as your bucket, and with an S3 endpoint set up) is probably the most consistent way to get this benchmarking done!

Thanks for this advice! I have kept everything in the same region, but I haven't tried setting up a VPC or endpoint yet.

Generally speaking, if the goal here is to retrieve data and pipe it into CuDF, I'm guessing that .to_arrow_iter() is likely the API of choice and should be what we'll want to benchmark!

Sounds good!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected performance of `df.to_arrow()` #2778

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Expected performance of df.to_arrow() #2778

rjzamora Sep 3, 2024

Replies: 2 comments · 2 replies

jaychia Sep 3, 2024 Maintainer

rjzamora Sep 4, 2024 Author

jaychia Sep 4, 2024 Maintainer

rjzamora Sep 4, 2024 Author

Expected performance of `df.to_arrow()` #2778

rjzamora
Sep 3, 2024

Replies: 2 comments 2 replies

jaychia
Sep 3, 2024
Maintainer

rjzamora Sep 4, 2024
Author

jaychia
Sep 4, 2024
Maintainer

rjzamora Sep 4, 2024
Author