The a7.ipynb
Jupyter Notebook carries out an EDA on the New York City Taxi and Limousine Commission (NYC TLC) Trip Records data, with special emphasis on the tips amount paid by riders through the year 2019.
The complete dataset used for this analysis has close to 80 million rows of observations from the year 2019, which acts as a representative big data of the trends in Trip Records in NYC. I make use of AWS's EMR cluster to leverage PySpark to carry out this analysis.
The Notebook has the following features:
- An introduction to the rationale behind considering this particular line of questioning for carrying out the EDA, and the scope for further modelling, and predictions that can be carried out using the data.
- 5 vizualizations using the data, which can be run using a PySpark kernel on an EMR cluster, and 5 corresponding descriptions for each of these vizualizations that explain the observations in question.
- A concluding remark regarding scalability and addressal of the some of the questions posed in the assignment.