-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AD suggested changes to ReadMe for senior role #4
base: main
Are you sure you want to change the base?
Changes from all commits
45482b5
4705ee3
a7ff4d8
81db496
74a8b58
38487dc
cdcde76
2b67d69
b7f705e
40a8b19
1b3b8ad
d6bd156
656eee5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ | |
|
||
Welcome to the Data Engineering code interview! This small data challenge is designed to test out your skills in python, sql, git, and geospatial data processing. The challenge will go from easy to difficult, there's no preassure to finish all the tasks, so try your best and get as far as you can! | ||
|
||
To start this challenge, create a new **private** repo under your github username. We would like you to include all the code, notes, visualizations, and data inside of the repo. You will have **48 hours** to complete this data challenge. Once you are done, please provide read access to your repo by inviting `@SashaWeinstein`, `@mbh329`, and `@AmandaDoyle` | ||
To start this challenge, create a new **private** repo under your github username. We would like you to include all the code, notes, visualizations, and data inside of the repo. You will have **48 hours** to complete this data challenge. Once you are done, please provide read access to your repo by inviting `@SashaWeinstein`, `@mbh329`, `td928`, and `@AmandaDoyle` | ||
|
||
> ⚠️ Note: **the repo has to be <ins>private</ins>, otherwise you will be automatically <ins>disqualified</ins>**. Also we will check your commit timestamp to only account for the first 48 hours of coding activities. | ||
|
||
|
@@ -44,49 +44,66 @@ Your code interview will be evaluated based on your repo, so make sure all files | |
|
||
We love the NYC 311 service and the open data products that come with it. In this challenge, you will use **[311 Service Requests from 2010 to Present](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)** on NYC Open Data, write an ETL pipeline, and produce some data insight. | ||
|
||
### Task 1: Data Download | ||
This challenge has two parts, a python part with 4 steps and a docker/SQL part with 2 steps. The first part will download files and process them, and the second part will use those same files again. It's our expectation that the logic to perform each step in the first half of this challenge will be written in python. | ||
|
||
Write a python script/notebook to download all service request records created in the **last week** (7 days) and has **HPD** as the responding agency, and store the data in a csv named `raw.csv` in a folder called `data`. | ||
### Python Task 1: Data Download | ||
|
||
### Task 2: Data Aggregation | ||
First you need to download 311 service request records. Write a python script to pulls data from the NYC Open DataAPI based on two filters. The first filter is on responding agency. The second filter is an integer corresponding to the number of days before the current date (i.e. passing "7" means getting records within the past week). The script should download the associated service requests to python memory and cache to a .csv. | ||
|
||
Create a time series table based on the `data/raw.csv` file we created from **Task 1** that has the following fields | ||
For example, if a user wanted to get all service request records created in the past five where DSNY is the responding agency, they would pass `DSNY` and `5` as the parameters. | ||
|
||
- `created_date_hour`: the timestap of request creation by date and hour | ||
For this task, we ask that you download all 311 service requests filed the **last seven days** where **HPD** is the responding agency. Save the data as a csv named `raw.csv` in a folder called `data`. | ||
|
||
*Bonus points* if you 1) allow different parameters to be passed to your script from the command line and/or 2) write a bash script to take command line args and call the python code. If you do any bonus task, demostrate your code is dynamic by downloading and saving a series of csv files each with a different timeframe and responding agency, and programmatically save these files as csvs with a naming convention of your choice. | ||
|
||
### Python Task 2: Data Aggregation | ||
|
||
Write a process to produce a time series table based on the `data/raw.csv`file we created in **Task 1** that has the following fields: | ||
|
||
- `created_date_time`/`created_date`: the timestap of request creation by date and hour OR just date | ||
- `complaint_type`: the type of the complaint | ||
- `count`: the count of service requests by `complaint_type` by `created_date_hour` | ||
|
||
Store this table in a csv under the `data` folder with a csv file name of your choice. | ||
|
||
### Task 3: Data Visualization | ||
*Bonus points* if you can | ||
- Control the choice of date+hour or just date and control complaint type from the command line | ||
- Make the complaint type breakdown optional and control this behavior from the command line as well | ||
- Store multiple tables that pull from different files you cached in python task 1 | ||
|
||
### Python Task 3: Data Visualization | ||
|
||
Create a multi-line plot to show the total service request counts by `created_date_time` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think having them produce multiple plots with the multiple .csv's they cached is a good test of writing reusable data viz code that sets axes/titles programmatically based on what it's passed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Sasha especially after the work we've been doing with the QAQC app. It's great having the ability to communicate effective data viz in succinct code especially when it comes to the little formatting issues that inevitably come up |
||
|
||
Create a multi-line plot to show the total service request counts by `created_date_hour` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file. | ||
*Bonus points* if your code is reasuable w/r/t different input tables. Show us that it is by saving multiple plots corresponding to the different tables you've cached in python step 2 | ||
|
||
### Task 4: Spatial data processing | ||
### Python Task 4: Spatial data processing | ||
|
||
At Data Engineering, we enhance datasets with geospatial attributes, such as point locations and administrative boundaries. To help us better understand the data from **Task 1**, we would like you to join the initial raw data to the **[2020 NTA (Neighborhood Tabulation Area) boundaries](https://www1.nyc.gov/site/planning/data-maps/open-data/census-download-metadata.page)** and create a choropleth map of 7 day total count by NTA of a specific `complaint_type` of your choice. | ||
At Data Engineering, we enhance datasets with geospatial attributes, such as point locations and administrative boundaries. To help us better understand the data from **Python Task 1**, we would like you to join the initial raw data to an NYC administrive boundary. Then create a choropleth map of the 7 day total count of complaints where `HPD` is the responding agency fot a specific `complaint_type` of your choice. | ||
|
||
Depending on how you generate the map, you can store the map as a `.png` or `.html` under the `data` folder. | ||
You need to find a **second geospatial dataset and do a geospatial operation** to assign each call to an area. You can't make a chloropleth of calls by borough, zipcode, community district or city, as that information is already assigned to each record. | ||
|
||
### Task 5: SQL | ||
Depending on how you generate the map, you can store the map as a `.png` or `.html` under the `data` folder. Make sure the map includes a legend and title so that it is self explanatory. | ||
|
||
We ❤️ SQL! At Data Engineering, we deal with databases a lot and we write a lot of fast and simple ETL pipelines using SQL. In this task, you will: | ||
### SQL/Docker Task 1: Build container and load data | ||
|
||
- Load the `data/raw.csv` into a database of your choice and name the table `sample_311`. Make sure this process is captured in a script. | ||
- Perform the same aggregation in **Task 2** in SQL and store the results in a table (same name as the corresponding csv file). | ||
We ❤️ SQL and docker! At Data Engineering, we work with database containers a lot and we write a lot of fast and simple ETL pipelines using SQL. In this task, you will: | ||
|
||
> Note: Depending on your preference, you can use or [Postgres](https://www.postgresql.org/), which is prefered; however, if you are familiar with [SQLite](https://docs.python.org/3/library/sqlite3.html) (much easier to set up and use), you can use that too. | ||
- Set up POSTGIS container using an image. [Here](https://registry.hub.docker.com/r/postgis/postgis/) is the one we use. | ||
- Load the `data/raw.csv` into a database and name the table `sample_311`. Make sure this process is captured in a script. | ||
- Perform the same aggregation in **Python Task 2** in SQL and store the results in a table (same name as the corresponding csv file). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems good to me, you've read my thoughts on the file name and having the interviewee find the image themselves |
||
|
||
### Task 6: Spatial SQL | ||
### SQL/Docker Task 2: Spatial SQL | ||
|
||
A lot of popular databases have geospatial extensions, which makes spatial data processing in SQL super easy to use. In this task you will: | ||
|
||
- Load the NTA data to the database as a spatial table | ||
- Do a spatial join in SQL between `sample_311` and the NTA table and add a `nta` column to `sample_311` | ||
- Perform the same aggregation in **Task 4** and store the result in a table. | ||
- **Bonus**: export the table with NTA geometry and complaint count into a shapefile under the `data` folder. | ||
- Load the adminstrative boundary data you used in **Python Task 4** into the database as a spatial table | ||
- Do a spatial join in SQL between `sample_311` and the administrative boundary and add the administrative boundary ID as a column to `sample_311` | ||
- Perform the same aggregation in **Python Task 4** and store the result in a table. | ||
|
||
> Note: At this point you might notice that spatial software is not as straight forward as a simple `pip install`. If you are stuck with database installation or pacakge installation, you might consider adopting **[docker](https://www.docker.com/)**. Docker has a steep learning curve, so don't waste too much time on it. | ||
*Bonus points* if you can | ||
- Export the table with the administrative boundary geometry and complaint count into a shapefile under the `data` folder. | ||
- Push an image with your setup and code to Docker hub and give the Data Engineering team instructions to pull it down and run the code. This bonus section will be graded on how easily we can access your image and make it work on your machines. If you don't have time to do that, we would still love to hear about your plan of how you would go about that task. | ||
|
||
## Resources | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having them pass the complaint type as an argument and having it be optional tests something that task 1 doesn't test. Optional args requires different implementation on the arg parse side and data processing side