Skip to content
/ EDA Public

This is an Exploratory Data Analysis I carried out on NYC's Taxi Data for a class assignment at UChicago. I made use of PySpark on an AWS EMR cluster to do so.

Notifications You must be signed in to change notification settings

bhavyapan/EDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Exploratory Data Analysis (EDA) Assignment for MACS30113 - Bhavya Pandey

The a7.ipynb Jupyter Notebook carries out an EDA on the New York City Taxi and Limousine Commission (NYC TLC) Trip Records data, with special emphasis on the tips amount paid by riders through the year 2019.

The complete dataset used for this analysis has close to 80 million rows of observations from the year 2019, which acts as a representative big data of the trends in Trip Records in NYC. I make use of AWS's EMR cluster to leverage PySpark to carry out this analysis.

The Notebook has the following features:

  • An introduction to the rationale behind considering this particular line of questioning for carrying out the EDA, and the scope for further modelling, and predictions that can be carried out using the data.
  • 5 vizualizations using the data, which can be run using a PySpark kernel on an EMR cluster, and 5 corresponding descriptions for each of these vizualizations that explain the observations in question.
  • A concluding remark regarding scalability and addressal of the some of the questions posed in the assignment.

About

This is an Exploratory Data Analysis I carried out on NYC's Taxi Data for a class assignment at UChicago. I made use of PySpark on an AWS EMR cluster to do so.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published