Crunching Apache Parquet Files with Apache Flink

This repo includes sample code to setup Flink dataflows to process Parquet files. The CSV datasets under resources/ are the Restaurant Score datasets downloaded from SF OpenData. For more information please see this post.

###Generating the Avro Model Classes

If you make any changes to the Avro schema files (*.avsc) under resources/ you should re-generate the model classes

./compile_schemas.sh

###Step 1: Converting the CSV Data Files to the Parquet Format

Below command converts and writes the CSV files under resources/ to /tmp/business, /tmp/violations, and /tmp/inspections directories in Parquet format.

mvn clean package exec:java -Dexec.mainClass="yigitbasi.nezih.ConvertToParquet"

###Step 2: Running the Flink Dataflow

mvn clean compile assembly:single
java -jar target/FlinkParquet-0.1-SNAPSHOT-jar-with-dependencies.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Crunching Apache Parquet Files with Apache Flink

Files

README.md

Latest commit

History

README.md

File metadata and controls

Crunching Apache Parquet Files with Apache Flink