analysis assignment

arayff · Sep 28, 2021 · 2aa3d8b · 2aa3d8b
1 parent cf4ada2
commit 2aa3d8b
Show file tree

Hide file tree

Showing 5 changed files with 163 additions and 15 deletions.
diff --git a/4-Data-Science-Lifecycle/14-Introduction/assignment.md b/4-Data-Science-Lifecycle/14-Introduction/assignment.md
@@ -12,9 +12,10 @@ You can also open the taxi data file in text editor or spreadsheet software like
 ## Instructions
 
 - Assess whether or not the data in this dataset can help answer the question.
+- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that could potentially be helpful in answering the client's question.
 - Write 3 questions that you would ask the client for more clarification and better understanding of the problem. 
 
-Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) for more information about the
+Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.
 
 ## Rubric
 

diff --git a/4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb b/4-Data-Science-Lifecycle/14-Introduction/notebook.ipynb
@@ -3,7 +3,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "# Exploring NYC Taxi data in Winter and Summer\r\n",
+        "# NYC Taxi data in Winter and Summer\r\n",
         "\r\n",
         "Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n"
       ],
@@ -13,6 +13,7 @@
       "cell_type": "code",
       "execution_count": null,
       "source": [
+        "#Install the pandas library\r\n",
         "!pip install pandas"
       ],
       "outputs": [],
@@ -25,10 +26,13 @@
       "execution_count": 7,
       "source": [
         "import pandas as pd\r\n",
-        "import glob\r\n",
         "\r\n",
         "path = '../../data/taxi.csv'\r\n",
+        "\r\n",
+        "#Load the csv file into a dataframe\r\n",
         "df = pd.read_csv(path)\r\n",
+        "\r\n",
+        "#Print the dataframe\r\n",
         "print(df)\r\n"
       ],
       "outputs": [

diff --git a/4-Data-Science-Lifecycle/15-analyzing/README.md b/4-Data-Science-Lifecycle/15-analyzing/README.md
@@ -28,10 +28,10 @@ In a few of the previous lessons, we have used Pandas to provide some descriptiv
 
 ## Sampling and Querying
 Exploring everything in a large dataset can be very time consuming and a task that’s usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of what’s in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While there’s no defined rule on how much data you should sample it’s important to note that the more data you sample, the more precise of a generalization you can make of about data. 
-Pandas has the [`sample()` function in its library]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples you’d like to receive and use. 
+Pandas has the [`sample()` function in its library](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples you’d like to receive and use. 
 
 General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about. 
-The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved.
+The [`query() `function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved.
 
 ## Exploring with Visualizations
 You don’t have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](3-Data-Visualization) to learn more about some popular ways to explore visually.
@@ -44,10 +44,6 @@ All the topics in this lesson can help identify missing or inconsistent values,
 
 ## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27)
 
-
-## Review & Self Study
-
-
 ## Assignment
 
 [Assignment Title](assignment.md)
diff --git a/4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb b/4-Data-Science-Lifecycle/15-analyzing/assignment.ipynb
@@ -0,0 +1,141 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# NYC Taxi data in Winter and Summer\r\n",
+        "\r\n",
+        "Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "source": [
+        "#Install the pandas library\r\n",
+        "!pip install pandas"
+      ],
+      "outputs": [],
+      "metadata": {
+        "scrolled": true
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "source": [
+        "import pandas as pd\r\n",
+        "\r\n",
+        "path = '../../data/taxi.csv'\r\n",
+        "\r\n",
+        "#Load the csv file into a dataframe\r\n",
+        "df = pd.read_csv(path)\r\n",
+        "\r\n",
+        "#Print the dataframe\r\n",
+        "print(df)\r\n"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "     VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \\\n",
+            "0         2.0  2019-07-15 16:27:53   2019-07-15 16:44:21              3.0   \n",
+            "1         2.0  2019-07-17 20:26:35   2019-07-17 20:40:09              6.0   \n",
+            "2         2.0  2019-07-06 16:01:08   2019-07-06 16:10:25              1.0   \n",
+            "3         1.0  2019-07-18 22:32:23   2019-07-18 22:35:08              1.0   \n",
+            "4         2.0  2019-07-19 14:54:29   2019-07-19 15:19:08              1.0   \n",
+            "..        ...                  ...                   ...              ...   \n",
+            "195       2.0  2019-01-18 08:42:15   2019-01-18 08:56:57              1.0   \n",
+            "196       1.0  2019-01-19 04:34:45   2019-01-19 04:43:44              1.0   \n",
+            "197       2.0  2019-01-05 10:37:39   2019-01-05 10:42:03              1.0   \n",
+            "198       2.0  2019-01-23 10:36:29   2019-01-23 10:44:34              2.0   \n",
+            "199       2.0  2019-01-30 06:55:58   2019-01-30 07:07:02              5.0   \n",
+            "\n",
+            "     trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \\\n",
+            "0             2.02         1.0                  N           186           233   \n",
+            "1             1.59         1.0                  N           141           161   \n",
+            "2             1.69         1.0                  N           246           249   \n",
+            "3             0.90         1.0                  N           229           141   \n",
+            "4             4.79         1.0                  N           237           107   \n",
+            "..             ...         ...                ...           ...           ...   \n",
+            "195           1.18         1.0                  N            43           237   \n",
+            "196           2.30         1.0                  N           148           234   \n",
+            "197           0.83         1.0                  N           237           263   \n",
+            "198           1.12         1.0                  N           144           113   \n",
+            "199           2.41         1.0                  N           209           107   \n",
+            "\n",
+            "     payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \\\n",
+            "0             1.0         12.0    1.0      0.5        4.08           0.0   \n",
+            "1             2.0         10.0    0.5      0.5        0.00           0.0   \n",
+            "2             2.0          8.5    0.0      0.5        0.00           0.0   \n",
+            "3             1.0          4.5    3.0      0.5        1.65           0.0   \n",
+            "4             1.0         19.5    0.0      0.5        5.70           0.0   \n",
+            "..            ...          ...    ...      ...         ...           ...   \n",
+            "195           1.0         10.0    0.0      0.5        2.16           0.0   \n",
+            "196           1.0          9.5    0.5      0.5        2.15           0.0   \n",
+            "197           1.0          5.0    0.0      0.5        1.16           0.0   \n",
+            "198           2.0          7.0    0.0      0.5        0.00           0.0   \n",
+            "199           1.0         10.5    0.0      0.5        1.00           0.0   \n",
+            "\n",
+            "     improvement_surcharge  total_amount  congestion_surcharge  \n",
+            "0                      0.3         20.38                   2.5  \n",
+            "1                      0.3         13.80                   2.5  \n",
+            "2                      0.3         11.80                   2.5  \n",
+            "3                      0.3          9.95                   2.5  \n",
+            "4                      0.3         28.50                   2.5  \n",
+            "..                     ...           ...                   ...  \n",
+            "195                    0.3         12.96                   0.0  \n",
+            "196                    0.3         12.95                   0.0  \n",
+            "197                    0.3          6.96                   0.0  \n",
+            "198                    0.3          7.80                   0.0  \n",
+            "199                    0.3         12.30                   0.0  \n",
+            "\n",
+            "[200 rows x 18 columns]\n"
+          ]
+        }
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Use the cells below to do your own Exploratory Data Analysis"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "source": [],
+      "outputs": [],
+      "metadata": {}
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3.9.7 64-bit ('venv': venv)"
+    },
+    "language_info": {
+      "mimetype": "text/x-python",
+      "name": "python",
+      "pygments_lexer": "ipython3",
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "version": "3.9.7",
+      "nbconvert_exporter": "python",
+      "file_extension": ".py"
+    },
+    "name": "04-nyc-taxi-join-weather-in-pandas",
+    "notebookId": 1709144033725344,
+    "interpreter": {
+      "hash": "6b9b57232c4b57163d057191678da2030059e733b8becc68f245de5a75abe84e"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}
diff --git a/4-Data-Science-Lifecycle/15-analyzing/assignment.md b/4-Data-Science-Lifecycle/15-analyzing/assignment.md
@@ -1,16 +1,22 @@
-# Analyzing for answers
+# Exploring for answers
 
-This continues the process of the lifecycle
+This is a continuation of the previous lesson's [assignment](..\14-Introduction\assignment.md), where we briefly took a look at the data set. Now we will be taking a deeper look at the data.
 
-They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
+Again, the question the client wants to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
 
-Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle.. You have been provided a notebook and data from Azure Open Datasets to explore.  For summer you choose June, July, and August and for winter you choose January, February, and December.
+Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle, where you are responsible for doing exploratory data analysis on the dataset. You have been provided a notebook and dataset that contains 200 taxi transactions from January and July 2019.
 
 ## Instructions
 
-In this directory is a [notebook](notebook.ipynb) that uses Python to load 6 months of yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) and Integrated Surface Data from NOAA. These datasets have been joined together in a Pandas dataframe.
+In this directory is a [notebook](notebook.ipynb) and data from the [Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets). Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.
+
+
+Use some the techniques in this lesson to do your own EDA in the notebook (add cells if you'd like) and answer the following questions:
+
+- What other influences in the data could affect the tip amount?
+- What columns will most likely not be needed to answer the client's questions?
+- Based on what has been provided so far, does the data seem to provide any evidence of seasonal tipping behavior?
 
-Your task is to ___
 
 ## Rubric