GitHub - lkellermann/sparkify-dw: Udacity Data Engineering Nanodegree Project #3.

Name	Name	Last commit message	Last commit date
Latest commit History 16 Commits
img	img
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
dwh.cfg-template	dwh.cfg-template
etl.py	etl.py
requirements.txt	requirements.txt
sql_queries.py	sql_queries.py

AWS RedShift Data Warehouse

This project contain details about creating and modeling the Marketing database on AWS RedShift.

About

Sparkify is becomming popular among population of Oregon region. Because of the large increase in users at our platform we also had to increase our music library. This situation led to an huge growth at songplays datamart table that was in localhost in previous years .

Our main goal with this project is to adapt the previous songplays datamart at our localhostinstance to perform well at AWS RedShift Cluster.To do this migration we developed a simple package named aws to help our team to create and access RedShift clusters on AWS.

The table songplays is an analytical table that contains information about the users of our platform and the songs they listened to, when, where they were and which kind of device they use to access our platform. This table is useful to get insights to marketing campaings

🏁 Getting Started

📝 Table of Contents

Before starting to run this project you need to fill a dwh.cfg file containing the informations about the RedShift cluster that have the sparkifydw database, and your database credentials. You will also need the ARN for read the source files from a S3 bucket. Below we provide an example of how the contents of this file should look like:

[CLUSTER]
HOST= sparkify-redshift.************.us-west-2.redshift.amazonaws.com
DB_NAME=sparkifydw
DB_USER=s*************r
DB_PASSWORD=S*************r
DB_PORT=5439

[IAM_ROLE]
ARN=arn:aws:iam::************:role/AmazonS3ReadOnlyAccess

The HOST parameter is the endpoint property of the cluster where the database DB_NAME is stored. The parameters DB_USER and DB_PASSWORD are respectively the database username and it's password.

Software requirements

📝 Table of Contents

To run the scripts in this repository you will need at least Python 3.7.X and install the packages on requirements.txt file in this repository with the following command:

pip install -f requirements.txt

We strongly recommend you to run the scripts on a separated Python virtual environment. Click here to know more about virtual environments

Remember to run the command above with the terminal open at the same directory where the requirements.txt file is in.

Installing and using

📝 Table of Contents

To install this project you just need to download it.

To execute this project for the first time you need to run the etl.py with the following command:

python etl.py

The etl.py file will access the file dwh.cfg to connect to the database in the provided RedShift cluster and run the queries to create the tables needed, copy the staging tables and insert the data at them. This queries are provided at sql_queries.pyfile.

If the script runs successfully you should see the row counts for each table at the end and also a file named with the pattern etl_YYYYMMDDHHMMSS.results containing the results.

Database Model

📝 Table of Contents

The songplays table is generated as illustrated by the ERD bellow:

The table below presents a brief description of the tables in this project and their respective source and owners.

Source files and owners

Table	Description	Source	Owner
`staging_events`	Staging fact table for user events.	`s3://udacity-dend/log_data`	Person 1
`staging_songs`	Staging table for song data .	`s3://udacity-dend/song_data`	Person 2
`artists`	Dimension table of artists.	`staging_songs`	Person 3
`songs`	Dimension table for songs.	`staging_songs`	Person 1
`time`	Dimension table for time/calendar.	`staging_events`	Person 1
`users`	Dimension table for users.	`staging_events`	Person 3
`songplays`	Analytical table for recomendation systems.	All above.	Person 2

⛏️ Built Using

📝 Table of Contents

Python- Programming language.
Dbeaver - Database tool.

✍️ Authors

📝 Table of Contents

@kellermann92 - Idea & Initial work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS RedShift Data Warehouse

📝Table of Contents

About

🏁 Getting Started

Software requirements

Installing and using

Database Model

⛏️ Built Using

✍️ Authors

About

Releases

Packages

Languages

License

lkellermann/sparkify-dw

Folders and files

Latest commit

History

Repository files navigation

AWS RedShift Data Warehouse

📝Table of Contents

About

🏁 Getting Started

Software requirements

Installing and using

Database Model

⛏️ Built Using

✍️ Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages