Block size issue with large files #2

varsha1288 · 2021-08-24T02:29:16Z

I am trying with the block size option, to increase the block size, since I have 50 MB file.

Can you please provide some input on what needs to be done.

I get the pyArrow error - straddle block size

Thanks !!

elibixby · 2021-10-01T01:39:05Z

FYI, I was getting the same error, turns out there's a pretty obvious reason when you look at the code. XML is tree based, whereas Parquet is columnar. The way this code serializes an xml file is essentially as a single row in a parquet database, even if you restrict it to repeated elements using the XPath argument. This is not at all how I would expect a large XML file to be transferred to a database.

It's pretty easy to write a parser that processes individual elements into rows in a Parquet database though (and doesn't do the unnecessary step of JSON serialization in between).

import xml.etree.ElementTree as ET
import pyarrow
import pandas as pd

def parse(row_schema, xml_file):
    rows = []
    for _, element in ET.iterparse(xml_file):
        if row_schema.is_valid(element):
            rows.append(row_schema.decode(element, validation='skip'))
    
     return pyarrow.Table.from_pandas(pd.DataFrame(rows))

The trick is just using the complex type in the schema object that defines the repeated element you want to define the columns in your Parquet file. (look for it in myschema.complex_types

davlee1972 · 2024-07-08T22:06:56Z

It's been a while since I've looked at this..

The normal use case is to pass in a xml path which is typically a repeating element which would get converted to a parquet row.
-p XPATHS, --xpaths XPATHS

If no xpath is passed in then your entire XML is parsed into a single parquet row which takes up a ton of memory and would be a very odd use case to store columnar data in a single row..

The xml parser normally tosses out xml elements from memory when a xml end tag is reached..

if xpath is set to /prices/price

<prices>
<date_sent>2024-06-01</date_sent>
<price>123</price> **is tossed out of xml memory when converted into a python row**
<price>456</price> **is tossed out of xml memory when converted into a python row**
<price>789</price> **is tossed out of xml memory when converted into a python row**
</prices>

**python rows are converted into a pyarrow table and then written to a columnar parquet file**

Someday I'll probably implement some sort of counter to dump x number of python rows into a parquet row group and append it to the parquet file to free up python memory..

If more than one xpath is passed into the script, the code will find the parent xpath in common and treat that parent xpath as the new row trigger *

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block size issue with large files #2

Block size issue with large files #2

varsha1288 commented Aug 24, 2021

elibixby commented Oct 1, 2021

davlee1972 commented Jul 8, 2024 •

edited

Loading

Block size issue with large files #2

Block size issue with large files #2

Comments

varsha1288 commented Aug 24, 2021

elibixby commented Oct 1, 2021

davlee1972 commented Jul 8, 2024 • edited Loading

davlee1972 commented Jul 8, 2024 •

edited

Loading