Skip to content

Commit

Permalink
analysis assignment
Browse files Browse the repository at this point in the history
  • Loading branch information
paladique committed Sep 28, 2021
1 parent cf4ada2 commit 2aa3d8b
Show file tree
Hide file tree
Showing 5 changed files with 163 additions and 15 deletions.
3 changes: 2 additions & 1 deletion 4-Data-Science-Lifecycle/14-Introduction/assignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@ You can also open the taxi data file in text editor or spreadsheet software like
## Instructions

- Assess whether or not the data in this dataset can help answer the question.
- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that could potentially be helpful in answering the client's question.
- Write 3 questions that you would ask the client for more clarification and better understanding of the problem.

Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) for more information about the
Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.

## Rubric

Expand Down
8 changes: 6 additions & 2 deletions 4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
{
"cell_type": "markdown",
"source": [
"# Exploring NYC Taxi data in Winter and Summer\r\n",
"# NYC Taxi data in Winter and Summer\r\n",
"\r\n",
"Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n"
],
Expand All @@ -13,6 +13,7 @@
"cell_type": "code",
"execution_count": null,
"source": [
"#Install the pandas library\r\n",
"!pip install pandas"
],
"outputs": [],
Expand All @@ -25,10 +26,13 @@
"execution_count": 7,
"source": [
"import pandas as pd\r\n",
"import glob\r\n",
"\r\n",
"path = '../../data/taxi.csv'\r\n",
"\r\n",
"#Load the csv file into a dataframe\r\n",
"df = pd.read_csv(path)\r\n",
"\r\n",
"#Print the dataframe\r\n",
"print(df)\r\n"
],
"outputs": [
Expand Down
8 changes: 2 additions & 6 deletions 4-Data-Science-Lifecycle/15-analyzing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ In a few of the previous lessons, we have used Pandas to provide some descriptiv

## Sampling and Querying
Exploring everything in a large dataset can be very time consuming and a task that’s usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of what’s in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While there’s no defined rule on how much data you should sample it’s important to note that the more data you sample, the more precise of a generalization you can make of about data.
Pandas has the [`sample()` function in its library]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples you’d like to receive and use.
Pandas has the [`sample()` function in its library](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples you’d like to receive and use.

General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about.
The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved.
The [`query() `function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved.

## Exploring with Visualizations
You don’t have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](3-Data-Visualization) to learn more about some popular ways to explore visually.
Expand All @@ -44,10 +44,6 @@ All the topics in this lesson can help identify missing or inconsistent values,

## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27)


## Review & Self Study


## Assignment

[Assignment Title](assignment.md)
141 changes: 141 additions & 0 deletions 4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# NYC Taxi data in Winter and Summer\r\n",
"\r\n",
"Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"#Install the pandas library\r\n",
"!pip install pandas"
],
"outputs": [],
"metadata": {
"scrolled": true
}
},
{
"cell_type": "code",
"execution_count": 7,
"source": [
"import pandas as pd\r\n",
"\r\n",
"path = '../../data/taxi.csv'\r\n",
"\r\n",
"#Load the csv file into a dataframe\r\n",
"df = pd.read_csv(path)\r\n",
"\r\n",
"#Print the dataframe\r\n",
"print(df)\r\n"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
"0 2.0 2019-07-15 16:27:53 2019-07-15 16:44:21 3.0 \n",
"1 2.0 2019-07-17 20:26:35 2019-07-17 20:40:09 6.0 \n",
"2 2.0 2019-07-06 16:01:08 2019-07-06 16:10:25 1.0 \n",
"3 1.0 2019-07-18 22:32:23 2019-07-18 22:35:08 1.0 \n",
"4 2.0 2019-07-19 14:54:29 2019-07-19 15:19:08 1.0 \n",
".. ... ... ... ... \n",
"195 2.0 2019-01-18 08:42:15 2019-01-18 08:56:57 1.0 \n",
"196 1.0 2019-01-19 04:34:45 2019-01-19 04:43:44 1.0 \n",
"197 2.0 2019-01-05 10:37:39 2019-01-05 10:42:03 1.0 \n",
"198 2.0 2019-01-23 10:36:29 2019-01-23 10:44:34 2.0 \n",
"199 2.0 2019-01-30 06:55:58 2019-01-30 07:07:02 5.0 \n",
"\n",
" trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n",
"0 2.02 1.0 N 186 233 \n",
"1 1.59 1.0 N 141 161 \n",
"2 1.69 1.0 N 246 249 \n",
"3 0.90 1.0 N 229 141 \n",
"4 4.79 1.0 N 237 107 \n",
".. ... ... ... ... ... \n",
"195 1.18 1.0 N 43 237 \n",
"196 2.30 1.0 N 148 234 \n",
"197 0.83 1.0 N 237 263 \n",
"198 1.12 1.0 N 144 113 \n",
"199 2.41 1.0 N 209 107 \n",
"\n",
" payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n",
"0 1.0 12.0 1.0 0.5 4.08 0.0 \n",
"1 2.0 10.0 0.5 0.5 0.00 0.0 \n",
"2 2.0 8.5 0.0 0.5 0.00 0.0 \n",
"3 1.0 4.5 3.0 0.5 1.65 0.0 \n",
"4 1.0 19.5 0.0 0.5 5.70 0.0 \n",
".. ... ... ... ... ... ... \n",
"195 1.0 10.0 0.0 0.5 2.16 0.0 \n",
"196 1.0 9.5 0.5 0.5 2.15 0.0 \n",
"197 1.0 5.0 0.0 0.5 1.16 0.0 \n",
"198 2.0 7.0 0.0 0.5 0.00 0.0 \n",
"199 1.0 10.5 0.0 0.5 1.00 0.0 \n",
"\n",
" improvement_surcharge total_amount congestion_surcharge \n",
"0 0.3 20.38 2.5 \n",
"1 0.3 13.80 2.5 \n",
"2 0.3 11.80 2.5 \n",
"3 0.3 9.95 2.5 \n",
"4 0.3 28.50 2.5 \n",
".. ... ... ... \n",
"195 0.3 12.96 0.0 \n",
"196 0.3 12.95 0.0 \n",
"197 0.3 6.96 0.0 \n",
"198 0.3 7.80 0.0 \n",
"199 0.3 12.30 0.0 \n",
"\n",
"[200 rows x 18 columns]\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"# Use the cells below to do your own Exploratory Data Analysis"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.9.7 64-bit ('venv': venv)"
},
"language_info": {
"mimetype": "text/x-python",
"name": "python",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"version": "3.9.7",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"name": "04-nyc-taxi-join-weather-in-pandas",
"notebookId": 1709144033725344,
"interpreter": {
"hash": "6b9b57232c4b57163d057191678da2030059e733b8becc68f245de5a75abe84e"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
18 changes: 12 additions & 6 deletions 4-Data-Science-Lifecycle/15-analyzing/assignment.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
# Analyzing for answers
# Exploring for answers

This continues the process of the lifecycle
This is a continuation of the previous lesson's [assignment](..\14-Introduction\assignment.md), where we briefly took a look at the data set. Now we will be taking a deeper look at the data.

They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Again, the question the client wants to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**

Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle.. You have been provided a notebook and data from Azure Open Datasets to explore. For summer you choose June, July, and August and for winter you choose January, February, and December.
Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle, where you are responsible for doing exploratory data analysis on the dataset. You have been provided a notebook and dataset that contains 200 taxi transactions from January and July 2019.

## Instructions

In this directory is a [notebook](notebook.ipynb) that uses Python to load 6 months of yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) and Integrated Surface Data from NOAA. These datasets have been joined together in a Pandas dataframe.
In this directory is a [notebook](notebook.ipynb) and data from the [Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets). Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.


Use some the techniques in this lesson to do your own EDA in the notebook (add cells if you'd like) and answer the following questions:

- What other influences in the data could affect the tip amount?
- What columns will most likely not be needed to answer the client's questions?
- Based on what has been provided so far, does the data seem to provide any evidence of seasonal tipping behavior?

Your task is to ___

## Rubric

Expand Down

0 comments on commit 2aa3d8b

Please sign in to comment.