Skip to content

Commit

Permalink
feat: update data text.
Browse files Browse the repository at this point in the history
  • Loading branch information
muziejus committed Nov 21, 2024
1 parent 618ff68 commit 7e6eb0d
Showing 1 changed file with 17 additions and 19 deletions.
36 changes: 17 additions & 19 deletions data.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,53 +4,51 @@

We are basing our analysis on two datasets, a dataset of taxi use in New
York and historical weather data. For both datasets, we will limit our
analysis to 2019–August 2024, in the interest of trying to catch about a
analysis to 2019-01-01 00:00:00 – 2024-06-25 23:00:00, in the interest of trying to catch about a
year of pre-pandemic norms to help interpret pandemic and post-pandemic use
of taxis.
of taxis. The taxi data only go to August, but the weather data only go
through most of June this year.

The taxi data are provided by New York’s [Taxi and Limousine
Commission](https://www.nyc.gov/site/tlc/index.page). They [provide taxi
data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), in
[parquet](https://parquet.apache.org/) format, going back to 2009. Since
February of 2019 they have also included data on trips serviced by
February of 2019, they have also included data on trips serviced by
companies like Lyft and Uber, which are classified as “high-volume for-hire
vehicle” (HVFHV) trips. The data are updated monthly, with a two-month lag;
however, at the time of submission, only data through August 2024 were
available. Our focus is on yellow taxi and HVFHV trips, because our focus
vehicle” (HVFHV) trips. The data are updated monthly, with a two-month lag.
Our focus is on yellow taxi and HVFHV trips, because our focus
is on intra-Manhattan trips. Only yellow cabs can pick up passengers in
most of Manhattan, so we are ignoring green cabs and regular for-hire
vehicles (town cars and limousines). Yellow cab data have [19
columns](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf),
of which the pertient columns for us are pickup/dropoff date and time
(`tpep_pickup_datetime`/`tpep_dropoff_datetime` in a datetime format), the
pickup/dropoff locations (`PULocationID`/`DOLocationID` in integer format,
corresponding to [NYC taxi
zones](https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv)) in
order to filter on Manhattan-only trips, passenger count (`Passenger_count`
in integer format), fare amount (`Fare_amount` in integer format), and tip
amount (`Tip_amount` in integer format).
and FHVHV data have [24 columns](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf).
With our Python scripts, we consolidate the data to create
aggregated hourly statistics on trip duration, trip distance, fare amount,
and tip amount.

The historical weather data are provided by the [Global Historical Climate
Network hourly
(GHCNh)](https://www.ncei.noaa.gov/products/global-historical-climatology-network-hourly),
which provides hourly weather data going back over two centuries for New
York City. We will be limiting ourselves to 2019–present, which comes in
York City. The data come in
[annual parquet files for download by
station](https://www.ncei.noaa.gov/oa/global-historical-climatology-network/index.html#hourly/access/by-year/).
Our station, `KNYC0`, is listed in the GHCNh as `USW00094728`, and it is
the weather station in Central Park. The data come in over 200 columns to
account for the variability that can occur in the terse
[METAR](https://en.wikipedia.org/wiki/METAR) report for airplanes, which is
also included under `remarks`. The government provides a
[codebook](https://www.ncei.noaa.gov/oa/global-historical-climatology-network/hourly/doc/ghcnh_DOCUMENTATION.pdf).
We may simply feed the reports to the
[python-metar](https://github.com/python-metar/python-metar) library, which
parses the reports for us. As we are interested in what conditions
[codebook](https://www.ncei.noaa.gov/oa/global-historical-climatology-network/hourly/doc/ghcnh_DOCUMENTATION.pdf) to describe the remaining data. As we are interested in what conditions
determine a “nice” day for not using a taxi, we want to keep as much data
from the weather report as possible, including temperature, precipitation,
sky cover, snow, humidity, and so on.

## Missing value analysis

The taxi data is notoriously (as in, persistently) messy, registering trips
outside the bounds of the asserted date and giving results that seem
extremely unlikely, like

Describe any patterns you discover in missing values. If no values are missing, graphs should still be included showing that.

(suggested: 2 graphs plus commentary)

0 comments on commit 7e6eb0d

Please sign in to comment.