The following document describes the model used to build the songplays datamart table and the respective ETL process.
The songplays
datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.
This document describes the model of songplays
table datamart on sparkify_app
schema inside a container sparkify_postgres
, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data
and mnt/data/song_data
paths in this repository.
First you need to have the right permissions to access the source files and write them into sparkify_app
tables that generates the songplays
datamart table. Contact the owners or your team leader for more information.
Source files and owners
File or table | Description | Directory | Owner |
---|---|---|---|
YYYY-MM-DD-events.json | User events. | mnt/data/log_data/YYYY/11 | Person 1 |
<random_tag>.json | Song data. | mnt/data/song_data/a | Person 2 |
songplays |
Datamart for recomendation system. | sparkify_app.songplays |
Person 3 |
artists |
Dimension table for artists. | sparkify_app.artists |
Person 1 |
songs |
Dimension table for songs. | sparkify_app.songs |
Person 1 |
time |
Dimension table for streaming start time for a given song. | sparkify_app.time |
Person 2 |
users |
Dimension table for users. | sparkify_app.users |
Person 3 |
To run this project first you need to install the Docker Engine for your operational system and Docker Compose.
After installing and configuring the Docker tools, download this repository and create a folder named postgres
that will store all sparkify_postgres
service data. To build the proper images and run the services, just execute the following command inside this repository:
docker-compose up
If the service runs successfully you should see something like this:
...
sparkify_python | 28/30 files processed.
sparkify_python | 29/30 files processed.
sparkify_python | 30/30 files processed.
sparkify_python exited with code 0
You can also check the job by following these steps:
-
Open your browser and access
localhost:16543
:- Enter with the following credentials to authenticate:
- e-mail:
[email protected]
- password:
sp4rk1fy
- e-mail:
- Enter with the following credentials to authenticate:
-
After you log in, click on the
Servers
option at the upper corner on the left:- You will be asked to enter with the PostgreSQL credentials:
- User:
sparkifypsql
- Password:
p4ssw0rd
- User:
- You will be asked to enter with the PostgreSQL credentials:
-
Under the
Query Editor
, run the following query:SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;
The following image represents the microservice architecture for this project:
Where:
sparkify_python
: runs all Python scripts and stores raw data.sparkify_postgres
: runs Postgre and stores the database.sparkify_pgadmin
: runs the pgAdmin tool to monitor thesparkify_postgres
service.
- Dbeaver - Database tool.
- Docker Compose - Tool to run multi-container applications.
- Docker Engine - Container engine.
- pandas - Data analysis and data wrangling tool.
- pgAdmin - PostgreSQL tool.
- psycopg2 - Database adapter for Python.
- PostgreSQL - Reletional database management system.
- @lkellermann - Idea & Initial work