MINI-DATALAKE

This is a "small" data lake created for the NewYorkers assessment. It includes all the strictly mandatory component (Spark, Hadoop, Hive) and also some useful tools (Airflow, Hue) All the services are running in docker containers

Get the data

We fetch the data from the yelp-dataset.

Manual download

Place the data under the folder mnt/airflow/files/yelp-dataset and it will be automatically copied into the container. like in the screenshot:

Kaggle API

To fetch the data from the Kaggle API the kaggle token need to be placed in docker/airflow/api/kaggle/kaggle.json. (NOT TESTED)

Usage

start datalake

The process can take some time during the first run.

bash start.sh

stop datalake

bash stop.sh

reset datalake

The image creation will be deleted

bash reset.sh

Service ports table

service	port
Airflow	http://localhost:8080
Hue	http://localhost:32762
HDFS	http://localhost:32763
Spark	http://localhost:32766

Connection needed on AirFlow

Hue

Conn id: hive-conn
Conn Type: Hive Server 2 Thrift
Host: hive-server
Login: hive (check docker/postgres/init-hive-db.sh)
Password: hive (check docker/postgres/init-hive-db.sh)
Port: 10000

Spark

Conn id: spark-conn
Conn Type: Spark
Host: spark://spark-master
Port: 7077

Airflow DAG

The developed pipeline name is ingest_yelp_from_local. If you are not familiar to Airflow, you can check the Airflow documentation

Result

The result can be checked on Hue.

Debugging

To debug the pipeline you can enter into the container with the command:

# to list all the containers IDs
docker ps

# to enter container shell
docker exec -it <container_id> /bin/bash

If any of the service is not working I suggest to restart all the services with these commands:

bash stop.sh
bash start.sh

Requirements

Docker (Allocate 8gb of RAM, 50GB of disk)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MINI-DATALAKE

Get the data

Manual download

Kaggle API

Usage

start datalake

stop datalake

reset datalake

Service ports table

Connection needed on AirFlow

Hue

Spark

Airflow DAG

Result

Debugging

Requirements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
doc		doc
docker		docker
mnt		mnt
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
reset.sh		reset.sh
start.sh		start.sh
stop.sh		stop.sh

SimCo92/mini-datalake

Folders and files

Latest commit

History

Repository files navigation

MINI-DATALAKE

Get the data

Manual download

Kaggle API

Usage

start datalake

stop datalake

reset datalake

Service ports table

Connection needed on AirFlow

Hue

Spark

Airflow DAG

Result

Debugging

Requirements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages