Oxbow is a simple project to take an existing storage location which contains Apache Parquet files into a Delta Lake table. It is intended to run both as an AWS Lambda or as a command line application.
The project is named after Oxbow lakes to keep with the lake theme.
Executing cargo build --release
from a clone of this repository will build
the command line binary oxbow
which can be used directly to convert a
directory full of .parquet
files into a Delta table.
This is an in place operation and will convert the specified table location into a Delta table!
% oxbow --table ./path/to/my/parquet-files
% export AWS_REGION=us-west-2
% export AWS_SECRET_ACCESS_KEY=xxxx
# Set other AWS environment variables
% oxbow --table s3://my-bucket/prefix/to/parquet
The deployment.tf
file contains the necessary Terraform to provision the
function, a DynamoDB table for locking, S3 bucket, and IAM permissions.
After configuring the necessary authentication for Terraform, the following steps can be used to provision:
cargo lambda build --release --output-format zip --bin oxbow-lambda
terraform init
terraform plan
terraform apply
ℹ️
|
Terraform configures the Lambda to run with the smallest amount of memory
allowed. For bucket locations with massive |
Building and testing can be done with cargo: cargo test
.
In order to deploy this in AWS Lambda, it must first be built with the cargo
lambda
command line tool, e.g.:
cargo lambda build --features lambda --release --output-format zip
This will produce the file: target/lambda/oxbow-lambda/bootstrap.zip
which can be
uploaded direectly in the web console, or referenced in the Terraform (see
deployment.tf
).
When running oxbow
via command line it is a one time operation. It will
take an existing directory or location full of .parquet
files and create a
Delta table out of it.
This repository is intentionally licensed under the AGPL 3.0. If your organization is interested in re-licensing this function for re-use, contact me via email for commercial licensing terms: [email protected]