Since the project data is very high, I have done this Project in Kaggle, please find it in the below link -
The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).
Ge the data from : http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (2015/2016 data) The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC)
The data is also provided in Kaggle
Kaggle Data Link : https://www.kaggle.com/vishnurapps/newyork-taxi-demand
These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.
FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.
The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.
In this notebook I am considering only the yellow taxis for the time period between Jan - Mar 2015 & Jan - Mar 2016- I have collected all yellow taxi trips data from jan-2015 to dec-2016(Will be using only 2015 data)
- I have used dask library to read the data due to its large size
file name | file name size | number of records | number of features |
---|---|---|---|
yellow_tripdata_2016-01 | 1. 59G | 10906858 | 19 |
yellow_tripdata_2016-02 | 1. 66G | 11382049 | 19 |
yellow_tripdata_2016-03 | 1. 78G | 12210952 | 19 |
yellow_tripdata_2016-04 | 1. 74G | 11934338 | 19 |
yellow_tripdata_2016-05 | 1. 73G | 11836853 | 19 |
yellow_tripdata_2016-06 | 1. 62G | 11135470 | 19 |
yellow_tripdata_2016-07 | 884Mb | 10294080 | 17 |
yellow_tripdata_2016-08 | 854Mb | 9942263 | 17 |
yellow_tripdata_2016-09 | 870Mb | 10116018 | 17 |
yellow_tripdata_2016-10 | 933Mb | 10854626 | 17 |
yellow_tripdata_2016-11 | 868Mb | 10102128 | 17 |
yellow_tripdata_2016-12 | 897Mb | 10449408 | 17 |
yellow_tripdata_2015-01 | 1.84Gb | 12748986 | 19 |
yellow_tripdata_2015-02 | 1.81Gb | 12450521 | 19 |
yellow_tripdata_2015-03 | 1.94Gb | 13351609 | 19 |
yellow_tripdata_2015-04 | 1.90Gb | 13071789 | 19 |
yellow_tripdata_2015-05 | 1.91Gb | 13158262 | 19 |
yellow_tripdata_2015-06 | 1.79Gb | 12324935 | 19 |
yellow_tripdata_2015-07 | 1.68Gb | 11562783 | 19 |
yellow_tripdata_2015-08 | 1.62Gb | 11130304 | 19 |
yellow_tripdata_2015-09 | 1.63Gb | 11225063 | 19 |
yellow_tripdata_2015-10 | 1.79Gb | 12315488 | 19 |
yellow_tripdata_2015-11 | 1.65Gb | 11312676 | 19 |
yellow_tripdata_2015-12 | 1.67Gb | 11460573 | 19 |
- Univariate analysis of the data is performed wherein all erroneous data was removed.
- As a part of the data cleaning phase we check if the pick-up and dropoff latitude and longitude fall within NYC and remove the ones that don’t.
- I also removed outliers based on fare prices, Trip Duration, Speed and Distance travelled.
As a part of our solution is to predict the pickups in a region within some time interval, we need to derive regions and time intervals from data.Lets begin with dividing NYC into K clusters using the latitude and longitude of pickups. Using K-Means clustering, we cluster the pick-up points into different regions. From the data, we observe that a taxi can cover up to 2 miles in 10 minutes. Therefore, we want the inner cluster distance to be greater than 2 miles but not lesser than 0.5 miles. The optimal K value must meet this constraint. We tried a range of different cluster values from 30 to 70 clusters. After trying a different range of clusters, we observed that when the number of clusters is 40.
Every region is split into 10-minute interval, which corresponds to one time bin, i.e. each time bin has 10 minute. However, in the data we have time in the format “YYYY-MM-DD HH:MM:SS which are converted to Unix time format so as to retrieve time in minute/hour format. Number of possible 10 min interval bins : 4464 bins
Now that the data is divided into 10-minute interval bins, each bin should contain at least one pickup. There are chances that a time bin may contain zero pick-ups leading to ratio feature error and division by zero therefore we need to smoothed these values
In theory, any waveform can be represented as the sum of infinite sine waves. Each sine wave has some amplitude and frequency. We can see that the number of pickups in a month in every cluster form a repeating pattern. We do not know the frequency of the repeating pattern. Our pattern cannot be represented by a single frequency as it’s aperiodic. Instead, it is composed of infinite sine wave with each sine wave having a frequency. Fourier transform lets us represent our pattern from time domain (number of pickups per time) to frequency domain(can be viewed as number pickup bins with highest number of pickups).
The first step to solving our business problem begins with baseline model. We will walk you through the process of discovering the right baseline model which gave us the best result.
-
Moving Averages
-
Weighted Moving Averages
-
Exponential Moving Averages
After doing all the data cleaning and preparation and baseline modelling, we now have an arsenal of features which would help build a good predication model. We see that, from baseline modeling, how Exponential weighted moving averages gives the best forecasting among the rest. We will use this as a feature while building the regression model along with others we got from data preparation stage.
For test and train split by time, we take 3 months of 2016 pickup data and split it such that for every region we have 70% data in train and 30% in the test. As this is a time-series problem, we have to split our train and test data on the basis of time.
-
LINEAR REGRESSION
-
RANDOM FOREST REGRESSION
-
XGBOOST REGRESSOR
Error Metric Matrix (Tree Based Regression Methods) - MAPE
--------------------------------------------------------------------------------------------------------
Baseline Model - Train: 0.1487 | Test: 0.1422
Exponential Averages Forecasting - Train: 0.1412 | Test: 0.1349
Linear Regression - Train: 0.1421 | Test: 0.1348
Random Forest Regression - Train: 0.0987 | Test: 0.1333
XgBoost Regression - Train: 0.1385 | Test: 0.1328
--------------------------------------------------------------------------------------------------------
-
I have built a total of 5 regression models using the pickup_densities. Some models used the time series property while the other models ignored it and was treated as a proper regression problem.
-
We can see that XGBoost performed based on TEST_MAPE was better than other models. Linear Regression was able to achieve similar values as XGBoost.
-
Here Random Forest Regression performs well on Train data and not good on test data which makes it overfit.Therefore, I choose XGBoost to avoid overfitting.
-
Some of the more important features used by these models are pickupdensity(P(t-1),P(t-2)...) followed by Exponential Moving Averages using the previous pickup densities(predicted).
-
Using these predictions, the Yellow taxi owners can decide where to station themselves during their hours of least pickup.