Modern Datalake Reference Tech Stack

This repo contains the configuration necessary to spin up a MinIO powered open-source and modern datalake. It can be used for training, experimentation, and hands-on demonstration. It is not production-grade.

Modern Datalake on MinIO Training

This datalake is pre-configured for ease of use with the guided exercises in the MinIO Modern Datalakes training series on youtube. Follow along with the guided exercises there to learn more.

System Dependencies

The following system level dependencies must be installed before using this repository. Please refer to the links below for OS specfic instruction on how to install:

Key Components

MinIO - S3 compatible object storage layer for data
Dremio - A lakehouse management service that offers a data catalog, SQL interface, and Iceberg compatible compute engine.
Apache Iceberg - The table format we use to store our data in the lake giving us many benefits like ACID compliance, schema evolution, and data time travel.
Project Nessie - Git like version control for data.
Apache Spark - Our compute engine for data ingestion and transformation.
JupyterLab - An interactive python environment for data science and data engineering.

Spinning up the environment

Copy the .env.example file to .env
```
$ cp .env.example .env
```
We will spin up Dremio, Nessie, and MinIO via docker compose.
```
$ docker compose up -d
```

Tail the docker compose logs with and wait until you see the following indicating all containers have started up:

$ docker compose logs -f
...output truncated...

dremio  | 2024-03-25 18:08:25,886 [main] INFO  com.dremio.dac.server.DremioServer - Started on http://localhost:9047
dremio  | Dremio Daemon Started as master

Using the mc command line tool create an alias for the minio server called minio1. Then create a bucket called warehouse where we will store our iceberg tables and metadata.

$ mc alias set minio1 http://localhost:9050 minioadmin minioadmin
Added `minio1` successfully.

$ mc mb minio1/warehouse
Bucket created successfully `minio1/warehouse`.

Execute the dremio initialization script which will create the first user and setup the connection between Dremio, Nessie, and MinIO

$ sh init_dremio.sh
...output truncated...

-----------------------------------------------
Dremio first time lab initialization complete
-----------------------------------------------

Login to Dremio at http://localhost:9047/ using username=admin password=bad4admins.

You should now be able to run SQL commands using the Dremio SQL runner against the iceberg tables in the datalake. For example you can try this:

# Create a fact_orders table partitioned by day:
CREATE TABLE nessie.fact_orders
(
    order_id     BIGINT,
    customer_id  BIGINT,
    order_amount DECIMAL (10, 2),
    order_ts     TIMESTAMP
) PARTITION BY (DAY (order_ts));

# Insert three rows into the fact_orders table:
INSERT INTO nessie.fact_orders
VALUES (111, 456, 36.17, '2024-01-07 08:12:23'),
   (112, 789, 67.15, '2024-01-07 08:23:00'),
   (113, 789, 21.00, '2024-01-08 11:12:23');

# Retrieve all the inserted rows to view them:
SELECT * FROM Nessie.fact_orders;

Congrats you are now running a modern data lake stack powered by MinIO entirely on your machine :-).

Optional Spark Notebooks

If you would like to interact with Iceberg tables using Python and Spark instead of Dremio and SQL simply do the following:

Startup a Jupyter Labs notebook container that is configured to run Apache Spark as a single node cluster.
```
$ docker compose --profile with_ipython_notebook up -d
```
Navigate to JupyterLab in your browser at http://127.0.0.1:9070/lab
Inside Jupyter run spark_table_create.ipynb to create an example iceberg table and register it with Nessie.

You can switch back and forth between Dremio/Spark and SQL/Python respectively. Modern datalakes make it relatively easy to swap in and out components like compute engines, runtime evironments, table formats etc.

Tearing down the environment

Spin down all containers and delete volumes with this command

$ docker compose --profile with_ipython_notebook down --volumes

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
notebooks		notebooks
.env.example		.env.example
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
init_dremio.sh		init_dremio.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern Datalake Reference Tech Stack

Modern Datalake on MinIO Training

System Dependencies

Key Components

Spinning up the environment

Optional Spark Notebooks

Tearing down the environment

About

Releases

Packages

Contributors 3

Languages

License

miniohq/datalake_ref_arch

Folders and files

Latest commit

History

Repository files navigation

Modern Datalake Reference Tech Stack

Modern Datalake on MinIO Training

System Dependencies

Key Components

Spinning up the environment

Optional Spark Notebooks

Tearing down the environment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages