Skip to content

Latest commit

 

History

History
97 lines (65 loc) · 2.44 KB

README.md

File metadata and controls

97 lines (65 loc) · 2.44 KB

Python ORC Reader

What is it?

This is my attempt to write an ORC reader in python. The situation is that we have a lot of ORC files on local disk to consume by Python but there is no efficient way to access the file without converting it to CSV or compatible format.

My approach is to use orc-core java library to read ORC file, then use py4j to bridge between Python and Java.

This approach is not yet validated and or may suffer from performance issue due to overhead. The proper approach would be using C++ reader from orc-core library. I want to go through this as an exercise to know more about ORC and py4j.

Installation

Until this package is available on PIP, you will have to install the package as following:

Compile java gateway

cd java-gateway
mvn clean compile assembly:single

Run setup.py script to install the package to the system

cd python
python setup.py install

Usage

After you setup the python package, you can create a reader as following:

from orcreader import OrcReader
reader = OrcReader(abs_path_orc_file)
reader.open()

To access the schema and number of records

print reader.num_rows
print reader.schema

You can iterate through the record of the file by looping through the reader

for row in reader:
  print row

Or you can do batching with batch(size)

# loop through 100 records at a time
for batch in reader.batch(100):
  print batch

Make sure to close the reader after you are done

reader.close() 

Alternatively, you can also use a with statement and the context manager will manage the life cycle of the reader for you.

with OrcReader(abs_path_orc_file) as reader:
     print reader.schema

There are some limitation at the monent such as we need an absolute path to the orc file. I will fix these later when I have time.

You can also try the orc2csv script to convert from ORC to CSV.

orc2csv /path/to/orcfile

TODOs

  • Auto start / top java gateway
  • Auto build gateway compiling
  • Unit tests
  • Publish package
  • Column projection and filtering
  • Wildcard directory support

Known Issues

  • Type conversion may be incorrect. I didn't spend a lot of time checking the correct type conversion between Python and Java.
  • There may be performance overhead when processing big ORC file.