Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AD suggested changes to ReadMe for senior role #4

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
61 changes: 39 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Welcome to the Data Engineering code interview! This small data challenge is designed to test out your skills in python, sql, git, and geospatial data processing. The challenge will go from easy to difficult, there's no preassure to finish all the tasks, so try your best and get as far as you can!

To start this challenge, create a new **private** repo under your github username. We would like you to include all the code, notes, visualizations, and data inside of the repo. You will have **48 hours** to complete this data challenge. Once you are done, please provide read access to your repo by inviting `@SashaWeinstein`, `@mbh329`, and `@AmandaDoyle`
To start this challenge, create a new **private** repo under your github username. We would like you to include all the code, notes, visualizations, and data inside of the repo. You will have **48 hours** to complete this data challenge. Once you are done, please provide read access to your repo by inviting `@SashaWeinstein`, `@mbh329`, `td928`, and `@AmandaDoyle`

> ⚠️ Note: **the repo has to be <ins>private</ins>, otherwise you will be automatically <ins>disqualified</ins>**. Also we will check your commit timestamp to only account for the first 48 hours of coding activities.

Expand Down Expand Up @@ -44,49 +44,66 @@ Your code interview will be evaluated based on your repo, so make sure all files

We love the NYC 311 service and the open data products that come with it. In this challenge, you will use **[311 Service Requests from 2010 to Present](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)** on NYC Open Data, write an ETL pipeline, and produce some data insight.

### Task 1: Data Download
This challenge has two parts, a python part with 4 steps and a docker/SQL part with 2 steps. The first part will download files and process them, and the second part will use those same files again. It's our expectation that the logic to perform each step in the first half of this challenge will be written in python.

Write a python script/notebook to download all service request records created in the **last week** (7 days) and has **HPD** as the responding agency, and store the data in a csv named `raw.csv` in a folder called `data`.
### Python Task 1: Data Download

### Task 2: Data Aggregation
First you need to download 311 service request records. Write a python script to pulls data from the NYC Open DataAPI based on two filters. The first filter is on responding agency. The second filter is an integer corresponding to the number of days before the current date (i.e. passing "7" means getting records within the past week). The script should download the associated service requests to python memory and cache to a .csv.

Create a time series table based on the `data/raw.csv` file we created from **Task 1** that has the following fields
For example, if a user wanted to get all service request records created in the past five where DSNY is the responding agency, they would pass `DSNY` and `5` as the parameters.

- `created_date_hour`: the timestap of request creation by date and hour
For this task, we ask that you download all 311 service requests filed the **last seven days** where **HPD** is the responding agency. Save the data as a csv named `raw.csv` in a folder called `data`.

*Bonus points* if you 1) allow different parameters to be passed to your script from the command line and/or 2) write a bash script to take command line args and call the python code. If you do any bonus task, demostrate your code is dynamic by downloading and saving a series of csv files each with a different timeframe and responding agency, and programmatically save these files as csvs with a naming convention of your choice.

### Python Task 2: Data Aggregation

Write a process to produce a time series table based on the `data/raw.csv`file we created in **Task 1** that has the following fields:

- `created_date_time`/`created_date`: the timestap of request creation by date and hour OR just date
- `complaint_type`: the type of the complaint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having them pass the complaint type as an argument and having it be optional tests something that task 1 doesn't test. Optional args requires different implementation on the arg parse side and data processing side

- `count`: the count of service requests by `complaint_type` by `created_date_hour`

Store this table in a csv under the `data` folder with a csv file name of your choice.

### Task 3: Data Visualization
*Bonus points* if you can
- Control the choice of date+hour or just date and control complaint type from the command line
- Make the complaint type breakdown optional and control this behavior from the command line as well
- Store multiple tables that pull from different files you cached in python task 1

### Python Task 3: Data Visualization

Create a multi-line plot to show the total service request counts by `created_date_time` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having them produce multiple plots with the multiple .csv's they cached is a good test of writing reusable data viz code that sets axes/titles programmatically based on what it's passed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Sasha especially after the work we've been doing with the QAQC app. It's great having the ability to communicate effective data viz in succinct code especially when it comes to the little formatting issues that inevitably come up


Create a multi-line plot to show the total service request counts by `created_date_hour` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file.
*Bonus points* if your code is reasuable w/r/t different input tables. Show us that it is by saving multiple plots corresponding to the different tables you've cached in python step 2

### Task 4: Spatial data processing
### Python Task 4: Spatial data processing

At Data Engineering, we enhance datasets with geospatial attributes, such as point locations and administrative boundaries. To help us better understand the data from **Task 1**, we would like you to join the initial raw data to the **[2020 NTA (Neighborhood Tabulation Area) boundaries](https://www1.nyc.gov/site/planning/data-maps/open-data/census-download-metadata.page)** and create a choropleth map of 7 day total count by NTA of a specific `complaint_type` of your choice.
At Data Engineering, we enhance datasets with geospatial attributes, such as point locations and administrative boundaries. To help us better understand the data from **Python Task 1**, we would like you to join the initial raw data to an NYC administrive boundary. Then create a choropleth map of the 7 day total count of complaints where `HPD` is the responding agency fot a specific `complaint_type` of your choice.

Depending on how you generate the map, you can store the map as a `.png` or `.html` under the `data` folder.
You need to find a **second geospatial dataset and do a geospatial operation** to assign each call to an area. You can't make a chloropleth of calls by borough, zipcode, community district or city, as that information is already assigned to each record.

### Task 5: SQL
Depending on how you generate the map, you can store the map as a `.png` or `.html` under the `data` folder. Make sure the map includes a legend and title so that it is self explanatory.

We ❤️ SQL! At Data Engineering, we deal with databases a lot and we write a lot of fast and simple ETL pipelines using SQL. In this task, you will:
### SQL/Docker Task 1: Build container and load data

- Load the `data/raw.csv` into a database of your choice and name the table `sample_311`. Make sure this process is captured in a script.
- Perform the same aggregation in **Task 2** in SQL and store the results in a table (same name as the corresponding csv file).
We ❤️ SQL and docker! At Data Engineering, we work with database containers a lot and we write a lot of fast and simple ETL pipelines using SQL. In this task, you will:

> Note: Depending on your preference, you can use or [Postgres](https://www.postgresql.org/), which is prefered; however, if you are familiar with [SQLite](https://docs.python.org/3/library/sqlite3.html) (much easier to set up and use), you can use that too.
- Set up POSTGIS container using an image. [Here](https://registry.hub.docker.com/r/postgis/postgis/) is the one we use.
- Load the `data/raw.csv` into a database and name the table `sample_311`. Make sure this process is captured in a script.
- Perform the same aggregation in **Python Task 2** in SQL and store the results in a table (same name as the corresponding csv file).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good to me, you've read my thoughts on the file name and having the interviewee find the image themselves


### Task 6: Spatial SQL
### SQL/Docker Task 2: Spatial SQL

A lot of popular databases have geospatial extensions, which makes spatial data processing in SQL super easy to use. In this task you will:

- Load the NTA data to the database as a spatial table
- Do a spatial join in SQL between `sample_311` and the NTA table and add a `nta` column to `sample_311`
- Perform the same aggregation in **Task 4** and store the result in a table.
- **Bonus**: export the table with NTA geometry and complaint count into a shapefile under the `data` folder.
- Load the adminstrative boundary data you used in **Python Task 4** into the database as a spatial table
- Do a spatial join in SQL between `sample_311` and the administrative boundary and add the administrative boundary ID as a column to `sample_311`
- Perform the same aggregation in **Python Task 4** and store the result in a table.

> Note: At this point you might notice that spatial software is not as straight forward as a simple `pip install`. If you are stuck with database installation or pacakge installation, you might consider adopting **[docker](https://www.docker.com/)**. Docker has a steep learning curve, so don't waste too much time on it.
*Bonus points* if you can
- Export the table with the administrative boundary geometry and complaint count into a shapefile under the `data` folder.
- Push an image with your setup and code to Docker hub and give the Data Engineering team instructions to pull it down and run the code. This bonus section will be graded on how easily we can access your image and make it work on your machines. If you don't have time to do that, we would still love to hear about your plan of how you would go about that task.

## Resources

Expand Down