Skip to content

Latest commit

 

History

History
90 lines (65 loc) · 3.12 KB

README.md

File metadata and controls

90 lines (65 loc) · 3.12 KB

Crime Analytics (New York City) - Data Engineering Project

This repo contains the final project implemented for the Data Engineering zoomcamp course.

Introduction

Crimes are a major contributor to social and economic unrest in metropolitan areas. This project aims to develop a workflow to ingest and process urban crime data, specifically for New York city, for downstream analysis.

Dataset

The NYPD Complaint Data Historic dataset is used. It can be obtained conveniently through the Socrata API. This dataset is updated at a quarterly interval.

Each row denotes a crime occurrence. Details include information about the time, location and descriptive categorizations of the crime events.

Tools

The following components were utilized to implement the required solution:

  • Data Ingestion: Socrata API (used via the sodapy python client).
  • Infrastructure as Code: Terraform
  • Workflow Management: Airflow
  • Data Lake: Google Cloud Storage
  • Data Warehouse: Google BigQuery
  • Data Transformation: Spark via Google Dataproc
  • Reporting: Google Data Studio

Architecture

Steps to Reproduce

Local setup

Cloud setup

  • In GCP, create a service principal with the following permissions:

    • BigQuery Admin
    • Storage Admin
    • Storage Object Admin
    • Dataproc Admin
  • Download the service principal authentication file and save it as $HOME/.google/credentials/google_credentials_project.json.

  • Ensure that the following APIs are enabled:

    • Compute Engine API
    • Cloud Dataproc API
    • Cloud Dataproc Control API
    • BigQuery API
    • Bigquery Storage API
    • Identity and Access Management (IAM) API
    • IAM Service Account Credentials API

Initializing Infrastructure (Terraform)

  • Perform the following to set up the required cloud infrastructure
cd terraform
terraform init
terraform plan
terraform apply

cd ..

Data Ingestion

  • Setup airflow to perform data ingestion
cd airflow

docker-compose build
docker-compose up airflow-init
docker-compose up -d
  • Go to the aiflow UI at the web address localhost:8080 and enable the data_crime_ingestion_dag.
  • This dag will ingest the crime data entries reported after 2015, upload it to the data lake and ingest it to the data warehouse.

Data Transformation

  • Enable and run the data_crime_process_dag
  • This will create a Dataproc cluster and submit a spark job to do perform the required transformations on the data.
  • The transformed data will be saved as a Bigquery table

Dashboard

It can also be viewed at this link.