-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serious performance issues with read #73
Comments
I suspect is the that RecordCursor loops thru each row and this is highly inefficient especially considering Parquet is a columnar format. I think the lowest level of cursor/iterations should be at the column chunk level, not at row level. The row level cursor should be in some utility library like ParquetFiles.jl to keep this package clean. |
Data does get read in chunks internally. Could you post some benchmark results comparing times of Parquet.jl master to the last release? |
I mean to say, it's dead slow on both master and latest release. |
I was able to trace some issues, PR here: #77 |
The issue seems to be much deeper with the reader design. In this fork here, I can read the file in 3~4 seconds, but Parquet.jl #77 it takes 130s See the fork here https://github.com/xiaodaigh/Parquet.jl/blob/master/README.md |
I tried the fork, following the readme pointed to above. Not sure if I got it right, but
|
Try this df.tile_count and you will see that it returns the whole column, not just one row. Each name is attached to a vector, not a single value. As to speed, it's more due to PC differences? I consistently get 3s on my 6 core machine, also I have 2 SSDs on RAID0 Update BTW, 3s should be considered really slow as R's arrow can read it and summarise it in about ~0.13s. So lot's of room to improve. |
Okay, I see the column now. But looks like you are disregarding the nested schema? So this will work only for non nested data? |
And you seem to be using the same underlying page and column chunk read methods of Parquet.jl, right? So the current performance issues are on the layer above that. |
That is correct. And the issue that is bothering me. I don't understand the nestedness, so I actually don't know how to deal with it. Ideally, there might be things in my approach that can be used to handle nested schemas? Honestly, the reason I am not motivated to do nested schema thing is that I have never needed it or used it. Would it be acceptable to have a set of functions just to read non-nested data? |
Yes I think so. Although, there's definitely much room to improve for the |
I guess. But I would try and figure out why we can not handle nestedness and be performant as well. |
I don't see a reason. The issue is I don't have the knowledge to contribute to nestedness and performant code. Also, I have no motivation there since I never use nested columns. So i only see 3 options
I would prefer 1 to 3, but I can only contribute to 1 to prevent 3, not 2. If 2 can happen, it would be great! But if 2 takes a year, then I would rather see 1 instead of 3 being the eventuality. I can commit to doing 1. |
This is basically what the definition levels and repetition levels allow. It is based on dremel, there would be some material out there if you searcg for it. |
yeah, I am not motivated that's the issue, since I never needed it. If you are too busy or also not motivated. Do you think we can ask NumFOCUS for funding? I would be happy to read the dremel paper and implement it for some monetary compensation. If we get funding do you want to split it? Happy for it to be a solo effort on either side too. Happy either way. |
I am open to this. Maybe we should keep it in a sub module to keep the interface clean. The sub module can implement the simpler non-nested read and write using the other lower level methods. |
OK. It will be quite a release with writer and column-based read. Or you want to finish off the writer PR first? Update |
Any ideas for the sub module name? |
just wanted to chime in real quick to say that it'd be great if we could improve the read performance! the difference between the julia parquet reader and fastparquet (via pycall) was like night and day for me. but i totally understand if there's no funding etc |
See the repo and the code below.
https://github.com/xiaodaigh/parquet-data-collection/blob/master/test_files.jl
To replicate, just clone the repo and run the above file (included in the repo).
It takes a LONG time.
The text was updated successfully, but these errors were encountered: