Skip to content

Latest commit

 

History

History
65 lines (45 loc) · 2.04 KB

README.md

File metadata and controls

65 lines (45 loc) · 2.04 KB

odata2avro Build Status Coverage Status

odata2avro is a Python command-line tool to automatically convert OData datasets to Avro. Using odata2avro together with standard Hadoop tooling, it should be very simple to ingest OData data from Microsoft Azure DataMarket to Hadoop.

Usage:

$ odata2avro ODATA_XML AVRO_SCHEMA AVRO_FILE

This command reads data from ODATA_XML and creates two files: AVRO_SCHEMA and AVRO_FILE. The Avro schema is in JSON format.

Example: Ingest data from Azure DataMarket to Hive/Impala

# Download OData data in XML format
$ curl 'https://api.datamarket.azure.com/opendata.rdw/VRTG.Open.Data/v1/KENT_VRTG_O_DAT?$top=100' > cars.xml

# Convert data to Avro
$ odata2avro cars.xml cars.avsc cars.avro

# Upload to HDFS
$ hdfs dfs -put cars.avro cars.avsc /tmp

# Create Avro-backed Hive table using Avro schema stored in /tmp/cars.avsc
$ hive -e "
  CREATE TABLE cars
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/cars.avsc');"

# Load data from /tmp/cars.avro to the cars table
$ hive -e "LOAD DATA INPATH '/tmp/cars.avro' INTO TABLE cars"

# Query with Impala
$ impala-shell -i <impala-daemon-ip> -q "REFRESH cars; select count(*) from cars"
+----------+
| count(*) |
+----------+
|      100 |
+----------+

Installation:

pip install odata2avro

Contributions:

Please create an issue if you spot any problem or bug. We'll try to get back to you as soon as possible.

Authors:

Created with passion by Marcel and Daan.