From 2aa3d8bd2dfd1a2401aa0ad4a5a13257b08e4651 Mon Sep 17 00:00:00 2001 From: Jasmine Date: Mon, 27 Sep 2021 20:59:00 -0400 Subject: [PATCH] analysis assignment --- .../14-Introduction/assignment.md | 3 +- .../14-Introduction/notebook.ipynb | 8 +- .../15-analyzing/README.md | 8 +- .../15-analyzing/assignment.ipynb | 141 ++++++++++++++++++ .../15-analyzing/assignment.md | 18 ++- 5 files changed, 163 insertions(+), 15 deletions(-) create mode 100644 4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb diff --git a/4-Data-Science-Lifecycle/14-Introduction/assignment.md b/4-Data-Science-Lifecycle/14-Introduction/assignment.md index 36b01d5d3..e0ff42448 100644 --- a/4-Data-Science-Lifecycle/14-Introduction/assignment.md +++ b/4-Data-Science-Lifecycle/14-Introduction/assignment.md @@ -12,9 +12,10 @@ You can also open the taxi data file in text editor or spreadsheet software like ## Instructions - Assess whether or not the data in this dataset can help answer the question. +- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that could potentially be helpful in answering the client's question. - Write 3 questions that you would ask the client for more clarification and better understanding of the problem. -Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) for more information about the +Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data. ## Rubric diff --git a/4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb b/4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb index e4041090a..01cf96e4a 100644 --- a/4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb +++ b/4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "source": [ - "# Exploring NYC Taxi data in Winter and Summer\r\n", + "# NYC Taxi data in Winter and Summer\r\n", "\r\n", "Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n" ], @@ -13,6 +13,7 @@ "cell_type": "code", "execution_count": null, "source": [ + "#Install the pandas library\r\n", "!pip install pandas" ], "outputs": [], @@ -25,10 +26,13 @@ "execution_count": 7, "source": [ "import pandas as pd\r\n", - "import glob\r\n", "\r\n", "path = '../../data/taxi.csv'\r\n", + "\r\n", + "#Load the csv file into a dataframe\r\n", "df = pd.read_csv(path)\r\n", + "\r\n", + "#Print the dataframe\r\n", "print(df)\r\n" ], "outputs": [ diff --git a/4-Data-Science-Lifecycle/15-analyzing/README.md b/4-Data-Science-Lifecycle/15-analyzing/README.md index 64999ed65..cf3810b4c 100644 --- a/4-Data-Science-Lifecycle/15-analyzing/README.md +++ b/4-Data-Science-Lifecycle/15-analyzing/README.md @@ -28,10 +28,10 @@ In a few of the previous lessons, we have used Pandas to provide some descriptiv ## Sampling and Querying Exploring everything in a large dataset can be very time consuming and a task that’s usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of what’s in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While there’s no defined rule on how much data you should sample it’s important to note that the more data you sample, the more precise of a generalization you can make of about data. -Pandas has the [`sample()` function in its library]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples you’d like to receive and use. +Pandas has the [`sample()` function in its library](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples you’d like to receive and use. General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about. -The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved. +The [`query() `function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved. ## Exploring with Visualizations You don’t have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](3-Data-Visualization) to learn more about some popular ways to explore visually. @@ -44,10 +44,6 @@ All the topics in this lesson can help identify missing or inconsistent values, ## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27) - -## Review & Self Study - - ## Assignment [Assignment Title](assignment.md) diff --git a/4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb b/4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb new file mode 100644 index 000000000..d3a080df3 --- /dev/null +++ b/4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb @@ -0,0 +1,141 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# NYC Taxi data in Winter and Summer\r\n", + "\r\n", + "Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": null, + "source": [ + "#Install the pandas library\r\n", + "!pip install pandas" + ], + "outputs": [], + "metadata": { + "scrolled": true + } + }, + { + "cell_type": "code", + "execution_count": 7, + "source": [ + "import pandas as pd\r\n", + "\r\n", + "path = '../../data/taxi.csv'\r\n", + "\r\n", + "#Load the csv file into a dataframe\r\n", + "df = pd.read_csv(path)\r\n", + "\r\n", + "#Print the dataframe\r\n", + "print(df)\r\n" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", + "0 2.0 2019-07-15 16:27:53 2019-07-15 16:44:21 3.0 \n", + "1 2.0 2019-07-17 20:26:35 2019-07-17 20:40:09 6.0 \n", + "2 2.0 2019-07-06 16:01:08 2019-07-06 16:10:25 1.0 \n", + "3 1.0 2019-07-18 22:32:23 2019-07-18 22:35:08 1.0 \n", + "4 2.0 2019-07-19 14:54:29 2019-07-19 15:19:08 1.0 \n", + ".. ... ... ... ... \n", + "195 2.0 2019-01-18 08:42:15 2019-01-18 08:56:57 1.0 \n", + "196 1.0 2019-01-19 04:34:45 2019-01-19 04:43:44 1.0 \n", + "197 2.0 2019-01-05 10:37:39 2019-01-05 10:42:03 1.0 \n", + "198 2.0 2019-01-23 10:36:29 2019-01-23 10:44:34 2.0 \n", + "199 2.0 2019-01-30 06:55:58 2019-01-30 07:07:02 5.0 \n", + "\n", + " trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n", + "0 2.02 1.0 N 186 233 \n", + "1 1.59 1.0 N 141 161 \n", + "2 1.69 1.0 N 246 249 \n", + "3 0.90 1.0 N 229 141 \n", + "4 4.79 1.0 N 237 107 \n", + ".. ... ... ... ... ... \n", + "195 1.18 1.0 N 43 237 \n", + "196 2.30 1.0 N 148 234 \n", + "197 0.83 1.0 N 237 263 \n", + "198 1.12 1.0 N 144 113 \n", + "199 2.41 1.0 N 209 107 \n", + "\n", + " payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n", + "0 1.0 12.0 1.0 0.5 4.08 0.0 \n", + "1 2.0 10.0 0.5 0.5 0.00 0.0 \n", + "2 2.0 8.5 0.0 0.5 0.00 0.0 \n", + "3 1.0 4.5 3.0 0.5 1.65 0.0 \n", + "4 1.0 19.5 0.0 0.5 5.70 0.0 \n", + ".. ... ... ... ... ... ... \n", + "195 1.0 10.0 0.0 0.5 2.16 0.0 \n", + "196 1.0 9.5 0.5 0.5 2.15 0.0 \n", + "197 1.0 5.0 0.0 0.5 1.16 0.0 \n", + "198 2.0 7.0 0.0 0.5 0.00 0.0 \n", + "199 1.0 10.5 0.0 0.5 1.00 0.0 \n", + "\n", + " improvement_surcharge total_amount congestion_surcharge \n", + "0 0.3 20.38 2.5 \n", + "1 0.3 13.80 2.5 \n", + "2 0.3 11.80 2.5 \n", + "3 0.3 9.95 2.5 \n", + "4 0.3 28.50 2.5 \n", + ".. ... ... ... \n", + "195 0.3 12.96 0.0 \n", + "196 0.3 12.95 0.0 \n", + "197 0.3 6.96 0.0 \n", + "198 0.3 7.80 0.0 \n", + "199 0.3 12.30 0.0 \n", + "\n", + "[200 rows x 18 columns]\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "# Use the cells below to do your own Exploratory Data Analysis" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": null, + "source": [], + "outputs": [], + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3.9.7 64-bit ('venv': venv)" + }, + "language_info": { + "mimetype": "text/x-python", + "name": "python", + "pygments_lexer": "ipython3", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "version": "3.9.7", + "nbconvert_exporter": "python", + "file_extension": ".py" + }, + "name": "04-nyc-taxi-join-weather-in-pandas", + "notebookId": 1709144033725344, + "interpreter": { + "hash": "6b9b57232c4b57163d057191678da2030059e733b8becc68f245de5a75abe84e" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/4-Data-Science-Lifecycle/15-analyzing/assignment.md b/4-Data-Science-Lifecycle/15-analyzing/assignment.md index b23f76a1f..ef043a849 100644 --- a/4-Data-Science-Lifecycle/15-analyzing/assignment.md +++ b/4-Data-Science-Lifecycle/15-analyzing/assignment.md @@ -1,16 +1,22 @@ -# Analyzing for answers +# Exploring for answers -This continues the process of the lifecycle +This is a continuation of the previous lesson's [assignment](..\14-Introduction\assignment.md), where we briefly took a look at the data set. Now we will be taking a deeper look at the data. -They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?** +Again, the question the client wants to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?** -Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle.. You have been provided a notebook and data from Azure Open Datasets to explore. For summer you choose June, July, and August and for winter you choose January, February, and December. +Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle, where you are responsible for doing exploratory data analysis on the dataset. You have been provided a notebook and dataset that contains 200 taxi transactions from January and July 2019. ## Instructions -In this directory is a [notebook](notebook.ipynb) that uses Python to load 6 months of yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) and Integrated Surface Data from NOAA. These datasets have been joined together in a Pandas dataframe. +In this directory is a [notebook](notebook.ipynb) and data from the [Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets). Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data. + + +Use some the techniques in this lesson to do your own EDA in the notebook (add cells if you'd like) and answer the following questions: + +- What other influences in the data could affect the tip amount? +- What columns will most likely not be needed to answer the client's questions? +- Based on what has been provided so far, does the data seem to provide any evidence of seasonal tipping behavior? -Your task is to ___ ## Rubric