-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block size issue with large files #2
Comments
FYI, I was getting the same error, turns out there's a pretty obvious reason when you look at the code. XML is tree based, whereas Parquet is columnar. The way this code serializes an xml file is essentially as a single row in a parquet database, even if you restrict it to repeated elements using the XPath argument. This is not at all how I would expect a large XML file to be transferred to a database. It's pretty easy to write a parser that processes individual elements into rows in a Parquet database though (and doesn't do the unnecessary step of JSON serialization in between).
The trick is just using the complex type in the schema object that defines the repeated element you want to define the columns in your Parquet file. (look for it in |
It's been a while since I've looked at this.. The normal use case is to pass in a xml path which is typically a repeating element which would get converted to a parquet row. If no xpath is passed in then your entire XML is parsed into a single parquet row which takes up a ton of memory and would be a very odd use case to store columnar data in a single row.. The xml parser normally tosses out xml elements from memory when a xml end tag is reached.. if xpath is set to /prices/price
Someday I'll probably implement some sort of counter to dump x number of python rows into a parquet row group and append it to the parquet file to free up python memory..
|
I am trying with the block size option, to increase the block size, since I have 50 MB file.
Can you please provide some input on what needs to be done.
I get the pyArrow error - straddle block size
Thanks !!
The text was updated successfully, but these errors were encountered: