This repo the is the accompaniment to the blog post where we explore the basics of setting up Hive, JDBC and REST Iceberg catalogs, and test writing and reading to Iceberg using Spark and Trino.
You would need these to get things working:
- Git
- Docker/Podman Compose
Clone this repo down, including submodules:
git clone --recurse-submodules [email protected]:binayakd/exploring-apache-iceberg.git
Inside the cloned folder, trigger the image builds:
docker compose build
This will build the following Images:
jupyter-spark
: this is the Jupyter Lab based development environment with all the client dependencies installedhive-metastore
: this will be used as the Iceberg Hive Catalogiceberg-rest-catalog
: this is a python Iceberg REST catalog by Kevin Liu, which I have forked, and added to this repo as a submodule
Now start all the services:
docker compose up
On top of the 3 images mentioned above, this will also start the following images:
minio
: this will be our local S3 alternative, the object storage holding the datamc
: this is the Minio client image, which is started to automatically create the initial bucker in Minio, then shutdown.postgres
: this is the Postgres that will be used by the catalogs. An init script in thepostgres-init
folder is used to create the required databases in the postgres instances on first startup.trino
: this is the Trino server, running as a single node cluster, with all the configs in thetrino-config
folder
Once the setup is done, the Jupyter lab instance can be accessed at: http://localhost:8888. There you will see the list of Jupyter Notebooks, which you can follow along in order:
- 00-setup.ipynb
- 01-iceberg-hive.ipynb
- 02-iceberg-jdbc.ipynb
- 03-iceberg-rest.ipynb
These are located in the workspace
folder in this repo.
The data created when running the notebooks will be saved under the local-data
folder.
If you run into permissions issues in the workspace
and the local-data
folders, you can run the permissions-fix.sh
script to try and fix it.