Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block size issue with large files #2

Open
varsha1288 opened this issue Aug 24, 2021 · 2 comments
Open

Block size issue with large files #2

varsha1288 opened this issue Aug 24, 2021 · 2 comments

Comments

@varsha1288
Copy link

I am trying with the block size option, to increase the block size, since I have 50 MB file.

Can you please provide some input on what needs to be done.

I get the pyArrow error - straddle block size

Thanks !!

@elibixby
Copy link

elibixby commented Oct 1, 2021

FYI, I was getting the same error, turns out there's a pretty obvious reason when you look at the code. XML is tree based, whereas Parquet is columnar. The way this code serializes an xml file is essentially as a single row in a parquet database, even if you restrict it to repeated elements using the XPath argument. This is not at all how I would expect a large XML file to be transferred to a database.

It's pretty easy to write a parser that processes individual elements into rows in a Parquet database though (and doesn't do the unnecessary step of JSON serialization in between).

import xml.etree.ElementTree as ET
import pyarrow
import pandas as pd

def parse(row_schema, xml_file):
    rows = []
    for _, element in ET.iterparse(xml_file):
        if row_schema.is_valid(element):
            rows.append(row_schema.decode(element, validation='skip'))
    
     return pyarrow.Table.from_pandas(pd.DataFrame(rows))

The trick is just using the complex type in the schema object that defines the repeated element you want to define the columns in your Parquet file. (look for it in myschema.complex_types

@davlee1972
Copy link
Collaborator

davlee1972 commented Jul 8, 2024

It's been a while since I've looked at this..

The normal use case is to pass in a xml path which is typically a repeating element which would get converted to a parquet row.
-p XPATHS, --xpaths XPATHS

If no xpath is passed in then your entire XML is parsed into a single parquet row which takes up a ton of memory and would be a very odd use case to store columnar data in a single row..

The xml parser normally tosses out xml elements from memory when a xml end tag is reached..

if xpath is set to /prices/price

<prices>
<date_sent>2024-06-01</date_sent>
<price>123</price> **is tossed out of xml memory when converted into a python row**
<price>456</price> **is tossed out of xml memory when converted into a python row**
<price>789</price> **is tossed out of xml memory when converted into a python row**
</prices>

**python rows are converted into a pyarrow table and then written to a columnar parquet file** 

Someday I'll probably implement some sort of counter to dump x number of python rows into a parquet row group and append it to the parquet file to free up python memory..

  • If more than one xpath is passed into the script, the code will find the parent xpath in common and treat that parent xpath as the new row trigger *

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants