diff --git a/CHANGELOG.md b/CHANGELOG.md
index bfefa57d2..a15174f7f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,8 @@
## 0.10.7dev
+* [Feature] Add Spark Connection as a dialect for Jupysql ([#965](https://github.com/ploomber/jupysql/issues/965)) (by [@gilandose](https://github.com/gilandose))
+
## 0.10.6 (2023-12-21)
* [Fix] Fix error when `%sql` includes a query with negative numbers ([#958](https://github.com/ploomber/jupysql/issues/958))
diff --git a/doc/_toc.yml b/doc/_toc.yml
index 2d9850b10..f667e807b 100644
--- a/doc/_toc.yml
+++ b/doc/_toc.yml
@@ -43,6 +43,7 @@ parts:
- file: integrations/duckdb-native
- file: integrations/compatibility
- file: integrations/chdb
+ - file: integrations/spark
- caption: API Reference
chapters:
diff --git a/doc/api/configuration.md b/doc/api/configuration.md
index e2bb114a5..254ea712a 100644
--- a/doc/api/configuration.md
+++ b/doc/api/configuration.md
@@ -234,6 +234,26 @@ value enables the ones from previous values plus new ones:
- `2`: All feedback
- Footer to distinguish pandas/polars data frames from JupySQL's result sets
+## `lazy_execution`
+
+```{versionadded} 0.10.7
+This option only works when connecting to Spark
+```
+
+Default: `False`
+
+Return lazy relation to dataset rather than executing through JupySql.
+
+```{code-cell} ipython3
+%config SqlMagic.lazy_execution = True
+df = %sql SELECT * FROM languages
+```
+
+```{code-cell} ipython3
+%config SqlMagic.lazy_execution = False
+res = %sql SELECT * FROM languages
+```
+
## `named_parameters`
```{versionadded} 0.9
diff --git a/doc/conf.py b/doc/conf.py
index 5e1792154..39080d77d 100644
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -27,6 +27,7 @@
"integrations/oracle.ipynb",
"integrations/snowflake.ipynb",
"integrations/redshift.ipynb",
+ "integrations/spark.ipynb",
]
nb_execution_in_temp = True
nb_execution_show_tb = True
diff --git a/doc/integrations/compatibility.md b/doc/integrations/compatibility.md
index 4e6b36432..d59760a98 100644
--- a/doc/integrations/compatibility.md
+++ b/doc/integrations/compatibility.md
@@ -114,4 +114,20 @@ These table reflects the compatibility status of JupySQL `>=0.7`
- Listing tables with `%sqlcmd tables` ✅
- Listing columns with `%sqlcmd columns` ✅
- Parametrized SQL queries via `{{parameter}}` ✅
-- Interactive SQL queries via `--interact` ✅
\ No newline at end of file
+- Interactive SQL queries via `--interact` ✅
+
+## Spark
+
+- Running queries with `%%sql` ✅
+- CTEs with `%%sql --save NAME` ✅
+- Plotting with `%%sqlplot boxplot` ❓
+- Plotting with `%%sqlplot bar` ✅
+- Plotting with `%%sqlplot pie` ✅
+- Plotting with `%%sqlplot histogram` ✅
+- Plotting with `ggplot` ✅
+- Profiling tables with `%sqlcmd profile` ✅
+- Listing tables with `%sqlcmd tables` ❌
+- Listing columns with `%sqlcmd columns` ❌
+- Parametrized SQL queries via `{{parameter}}` ✅
+- Interactive SQL queries via `--interact` ✅
+- Persisting Dataframes via `--persist` ✅
\ No newline at end of file
diff --git a/doc/integrations/spark.ipynb b/doc/integrations/spark.ipynb
new file mode 100644
index 000000000..4f150500d
--- /dev/null
+++ b/doc/integrations/spark.ipynb
@@ -0,0 +1,1399 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Spark\n",
+ "\n",
+ "This tutorial will show you how to get a Spark instance up and running locally to integrate with JupySQL. You can run this in a Jupyter notebook. We'll use [Spark Connect](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html) which is the new thin client for Spark"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Pre-requisites\n",
+ "\n",
+ "To run this tutorial, you need to install following Python packages:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Note: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install jupysql pyspark==3.4.1 arrow pyarrow==12.0.1 pandas grpcio-status --quiet"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Start Spark instance\n",
+ "\n",
+ "We fetch the official image, create a new database, and user (this will take a few seconds)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "12f699ee8e8e35ab10186f3c39024a7e443691bb4213e56ca3c2e90cd80daf1b\n"
+ ]
+ }
+ ],
+ "source": [
+ "%%bash\n",
+ "docker run -p 15002:15002 -p 4040:4040 -d --name spark wh1isper/sparglim-server"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Our database is running, let's load some data!"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Load sample data\n",
+ "\n",
+ "Now, let's fetch some sample data. We'll be using the [NYC taxi dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from pyspark.sql.connect.session import SparkSession\n",
+ "\n",
+ "spark = SparkSession.builder.remote(\"sc://localhost\").getOrCreate()\n",
+ "\n",
+ "df = pd.read_parquet(\n",
+ " \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet\"\n",
+ ")\n",
+ "sparkDf = spark.createDataFrame(df.head(10000))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Set [eagerEval](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#Viewing-Data) on to print dataframes, This makes Spark print dataframes eagerly in notebook environments, rather than it's default lazy execution which requires .show() to see the data. In Spark 3.4.1 we need to override, as below, but in 3.5.0 it will print in html. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def __pretty_(self, p, cycle):\n",
+ " self.show(truncate=False)\n",
+ "\n",
+ "\n",
+ "from pyspark.sql.connect.dataframe import DataFrame\n",
+ "\n",
+ "DataFrame._repr_pretty_ = __pretty_\n",
+ "spark.conf.set(\"spark.sql.repl.eagerEval.enabled\", True)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Add dataset to temporary view to allow querying:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "sparkDf.createOrReplaceTempView(\"taxi\")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Query\n",
+ "\n",
+ "Now, let's start JupySQL, authenticate, and query the data!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The sql extension is already loaded. To reload it, use:\n",
+ " %reload_ext sql\n"
+ ]
+ }
+ ],
+ "source": [
+ "%load_ext sql"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "%sql spark"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```{important}\n",
+ "If the cell above fails, you might have some missing packages. Message us on [Slack](https://ploomber.io/community) and we'll help you!\n",
+ "```"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "List the tables in the database:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ " \n",
+ " \n",
+ " namespace | \n",
+ " viewName | \n",
+ " isTemporary | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | \n",
+ " taxi | \n",
+ " True | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ "+-----------+----------+-------------+\n",
+ "| namespace | viewName | isTemporary |\n",
+ "+-----------+----------+-------------+\n",
+ "| | taxi | True |\n",
+ "+-----------+----------+-------------+"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%sql show views in default"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can turn on `lazy_spark` to avoid executing spark plan and return a Spark Dataframe"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%config SqlMagic.lazy_execution = True"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+---------+--------+-----------+\n",
+ "|namespace|viewName|isTemporary|\n",
+ "+---------+--------+-----------+\n",
+ "| |taxi |true |\n",
+ "+---------+--------+-----------+\n",
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": []
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%sql show views in default"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%config SqlMagic.lazy_execution = False"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "List columns in the taxi table:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "root\n",
+ " |-- VendorID: long (nullable = true)\n",
+ " |-- tpep_pickup_datetime: timestamp (nullable = true)\n",
+ " |-- tpep_dropoff_datetime: timestamp (nullable = true)\n",
+ " |-- passenger_count: double (nullable = true)\n",
+ " |-- trip_distance: double (nullable = true)\n",
+ " |-- RatecodeID: double (nullable = true)\n",
+ " |-- store_and_fwd_flag: string (nullable = true)\n",
+ " |-- PULocationID: long (nullable = true)\n",
+ " |-- DOLocationID: long (nullable = true)\n",
+ " |-- payment_type: long (nullable = true)\n",
+ " |-- fare_amount: double (nullable = true)\n",
+ " |-- extra: double (nullable = true)\n",
+ " |-- mta_tax: double (nullable = true)\n",
+ " |-- tip_amount: double (nullable = true)\n",
+ " |-- tolls_amount: double (nullable = true)\n",
+ " |-- improvement_surcharge: double (nullable = true)\n",
+ " |-- total_amount: double (nullable = true)\n",
+ " |-- congestion_surcharge: double (nullable = true)\n",
+ " |-- airport_fee: double (nullable = true)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "df = %sql select * from taxi\n",
+ "df.sqlaproxy.dataframe.printSchema()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Query our data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ " count(1) | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 10000 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ "+----------+\n",
+ "| count(1) |\n",
+ "+----------+\n",
+ "| 10000 |\n",
+ "+----------+"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%sql\n",
+ "SELECT COUNT(*) FROM taxi"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Parameterize queries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "threshold = 10"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ " count(1) | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 9476 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ "+----------+\n",
+ "| count(1) |\n",
+ "+----------+\n",
+ "| 9476 |\n",
+ "+----------+"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%sql\n",
+ "SELECT COUNT(*) FROM taxi\n",
+ "WHERE trip_distance < {{threshold}}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "threshold = 0.5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ " count(1) | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 642 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ "+----------+\n",
+ "| count(1) |\n",
+ "+----------+\n",
+ "| 642 |\n",
+ "+----------+"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%sql\n",
+ "SELECT COUNT(*) FROM taxi\n",
+ "WHERE trip_distance < {{threshold}}"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## CTEs"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "Skipping execution..."
+ ],
+ "text/plain": [
+ "Skipping execution..."
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "%%sql --save many_passengers --no-execute\n",
+ "SELECT *\n",
+ "FROM taxi\n",
+ "WHERE passenger_count > 3\n",
+ "-- remove top 1% outliers for better visualization\n",
+ "AND trip_distance < 18.93"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "Running query in 'SparkSession'"
+ ],
+ "text/plain": [
+ "Running query in 'SparkSession'"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ " min(trip_distance) | \n",
+ " avg(trip_distance) | \n",
+ " max(trip_distance) | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0.0 | \n",
+ " 3.1091381872213963 | \n",
+ " 18.46 | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ "+--------------------+--------------------+--------------------+\n",
+ "| min(trip_distance) | avg(trip_distance) | max(trip_distance) |\n",
+ "+--------------------+--------------------+--------------------+\n",
+ "| 0.0 | 3.1091381872213963 | 18.46 |\n",
+ "+--------------------+--------------------+--------------------+"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%%sql --save trip_stats --with many_passengers\n",
+ "SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)\n",
+ "FROM many_passengers"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This is what JupySQL executes:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "scrolled": true,
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WITH `many_passengers` AS (\n",
+ "SELECT *\n",
+ "FROM taxi\n",
+ "WHERE passenger_count > 3\n",
+ "\n",
+ "AND trip_distance < 18.93)\n",
+ "SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)\n",
+ "FROM many_passengers\n"
+ ]
+ }
+ ],
+ "source": [
+ "query = %sqlcmd snippets trip_stats\n",
+ "print(query)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Profiling"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ " Following statistics are not available in\n",
+ " SparkSession: STD, 25%, 50%, 75%
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " VendorID | \n",
+ " tpep_pickup_datetime | \n",
+ " tpep_dropoff_datetime | \n",
+ " passenger_count | \n",
+ " trip_distance | \n",
+ " RatecodeID | \n",
+ " store_and_fwd_flag | \n",
+ " PULocationID | \n",
+ " DOLocationID | \n",
+ " payment_type | \n",
+ " fare_amount | \n",
+ " extra | \n",
+ " mta_tax | \n",
+ " tip_amount | \n",
+ " tolls_amount | \n",
+ " improvement_surcharge | \n",
+ " total_amount | \n",
+ " congestion_surcharge | \n",
+ " airport_fee | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " count | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 10000 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " unique | \n",
+ " 2 | \n",
+ " 8766 | \n",
+ " 8745 | \n",
+ " 7 | \n",
+ " 1243 | \n",
+ " 6 | \n",
+ " 2 | \n",
+ " 173 | \n",
+ " 230 | \n",
+ " 4 | \n",
+ " 228 | \n",
+ " 8 | \n",
+ " 3 | \n",
+ " 504 | \n",
+ " 18 | \n",
+ " 3 | \n",
+ " 959 | \n",
+ " 3 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " top | \n",
+ " nan | \n",
+ " 2021-01-01 00:41:19 | \n",
+ " 2021-01-02 00:00:00 | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " N | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " None | \n",
+ "
\n",
+ " \n",
+ " freq | \n",
+ " nan | \n",
+ " 4 | \n",
+ " 7 | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " 9808 | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " nan | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " mean | \n",
+ " 1.6901 | \n",
+ " nan | \n",
+ " nan | \n",
+ " 1.5080 | \n",
+ " 3.1002 | \n",
+ " 1.0712 | \n",
+ " nan | \n",
+ " 158.5551 | \n",
+ " 154.7296 | \n",
+ " 1.3819 | \n",
+ " 11.8822 | \n",
+ " 0.8259 | \n",
+ " 0.4864 | \n",
+ " 1.7846 | \n",
+ " 0.2246 | \n",
+ " 0.2945 | \n",
+ " 16.9696 | \n",
+ " 2.1063 | \n",
+ " nan | \n",
+ "
\n",
+ " \n",
+ " std | \n",
+ " 0.4625 | \n",
+ " nan | \n",
+ " nan | \n",
+ " 1.1354 | \n",
+ " 3.5970 | \n",
+ " 1.0755 | \n",
+ " nan | \n",
+ " 70.9288 | \n",
+ " 75.2504 | \n",
+ " 0.5552 | \n",
+ " 10.8420 | \n",
+ " 1.1167 | \n",
+ " 0.1041 | \n",
+ " 2.4351 | \n",
+ " 1.2730 | \n",
+ " 0.0570 | \n",
+ " 12.5023 | \n",
+ " 0.9562 | \n",
+ " nan | \n",
+ "
\n",
+ " \n",
+ " min | \n",
+ " 1 | \n",
+ " nan | \n",
+ " nan | \n",
+ " 0.0 | \n",
+ " 0.0 | \n",
+ " 1.0 | \n",
+ " nan | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " -100.0 | \n",
+ " -0.5 | \n",
+ " -0.5 | \n",
+ " -1.07 | \n",
+ " -6.12 | \n",
+ " -0.3 | \n",
+ " -100.3 | \n",
+ " -2.5 | \n",
+ " nan | \n",
+ "
\n",
+ " \n",
+ " 25% | \n",
+ " 1.0000 | \n",
+ " nan | \n",
+ " nan | \n",
+ " 1.0000 | \n",
+ " 1.0400 | \n",
+ " 1.0000 | \n",
+ " nan | \n",
+ " 100.0000 | \n",
+ " 83.0000 | \n",
+ " 1.0000 | \n",
+ " 6.0000 | \n",
+ " 0.0000 | \n",
+ " 0.5000 | \n",
+ " 0.0000 | \n",
+ " 0.0000 | \n",
+ " 0.3000 | \n",
+ " 10.3000 | \n",
+ " 2.5000 | \n",
+ " nan | \n",
+ "
\n",
+ " \n",
+ " 50% | \n",
+ " 2.0000 | \n",
+ " nan | \n",
+ " nan | \n",
+ " 1.0000 | \n",
+ " 1.9300 | \n",
+ " 1.0000 | \n",
+ " nan | \n",
+ " 152.0000 | \n",
+ " 151.0000 | \n",
+ " 1.0000 | \n",
+ " 8.5000 | \n",
+ " 0.5000 | \n",
+ " 0.5000 | \n",
+ " 1.5400 | \n",
+ " 0.0000 | \n",
+ " 0.3000 | \n",
+ " 13.5500 | \n",
+ " 2.5000 | \n",
+ " nan | \n",
+ "
\n",
+ " \n",
+ " 75% | \n",
+ " 2.0000 | \n",
+ " nan | \n",
+ " nan | \n",
+ " 2.0000 | \n",
+ " 3.6000 | \n",
+ " 1.0000 | \n",
+ " nan | \n",
+ " 234.0000 | \n",
+ " 234.0000 | \n",
+ " 2.0000 | \n",
+ " 13.5000 | \n",
+ " 2.5000 | \n",
+ " 0.5000 | \n",
+ " 2.6500 | \n",
+ " 0.0000 | \n",
+ " 0.3000 | \n",
+ " 19.3000 | \n",
+ " 2.5000 | \n",
+ " nan | \n",
+ "
\n",
+ " \n",
+ " max | \n",
+ " 2 | \n",
+ " nan | \n",
+ " nan | \n",
+ " 6.0 | \n",
+ " 45.92 | \n",
+ " 99.0 | \n",
+ " nan | \n",
+ " 265 | \n",
+ " 265 | \n",
+ " 4 | \n",
+ " 121.0 | \n",
+ " 3.5 | \n",
+ " 0.5 | \n",
+ " 80.0 | \n",
+ " 25.5 | \n",
+ " 0.3 | \n",
+ " 137.76 | \n",
+ " 2.5 | \n",
+ " nan | \n",
+ "
\n",
+ " \n",
+ "
"
+ ],
+ "text/plain": [
+ "+--------+----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+--------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+\n",
+ "| | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |\n",
+ "+--------+----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+--------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+\n",
+ "| count | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 0 |\n",
+ "| unique | 2 | 8766 | 8745 | 7 | 1243 | 6 | 2 | 173 | 230 | 4 | 228 | 8 | 3 | 504 | 18 | 3 | 959 | 3 | 0 |\n",
+ "| top | nan | 2021-01-01 00:41:19 | 2021-01-02 00:00:00 | nan | nan | nan | N | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | None |\n",
+ "| freq | nan | 4 | 7 | nan | nan | nan | 9808 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 |\n",
+ "| mean | 1.6901 | nan | nan | 1.5080 | 3.1002 | 1.0712 | nan | 158.5551 | 154.7296 | 1.3819 | 11.8822 | 0.8259 | 0.4864 | 1.7846 | 0.2246 | 0.2945 | 16.9696 | 2.1063 | nan |\n",
+ "| std | 0.4625 | nan | nan | 1.1354 | 3.5970 | 1.0755 | nan | 70.9288 | 75.2504 | 0.5552 | 10.8420 | 1.1167 | 0.1041 | 2.4351 | 1.2730 | 0.0570 | 12.5023 | 0.9562 | nan |\n",
+ "| min | 1 | nan | nan | 0.0 | 0.0 | 1.0 | nan | 1 | 1 | 1 | -100.0 | -0.5 | -0.5 | -1.07 | -6.12 | -0.3 | -100.3 | -2.5 | nan |\n",
+ "| 25% | 1.0000 | nan | nan | 1.0000 | 1.0400 | 1.0000 | nan | 100.0000 | 83.0000 | 1.0000 | 6.0000 | 0.0000 | 0.5000 | 0.0000 | 0.0000 | 0.3000 | 10.3000 | 2.5000 | nan |\n",
+ "| 50% | 2.0000 | nan | nan | 1.0000 | 1.9300 | 1.0000 | nan | 152.0000 | 151.0000 | 1.0000 | 8.5000 | 0.5000 | 0.5000 | 1.5400 | 0.0000 | 0.3000 | 13.5500 | 2.5000 | nan |\n",
+ "| 75% | 2.0000 | nan | nan | 2.0000 | 3.6000 | 1.0000 | nan | 234.0000 | 234.0000 | 2.0000 | 13.5000 | 2.5000 | 0.5000 | 2.6500 | 0.0000 | 0.3000 | 19.3000 | 2.5000 | nan |\n",
+ "| max | 2 | nan | nan | 6.0 | 45.92 | 99.0 | nan | 265 | 265 | 4 | 121.0 | 3.5 | 0.5 | 80.0 | 25.5 | 0.3 | 137.76 | 2.5 | nan |\n",
+ "+--------+----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+--------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "%sqlcmd profile -t taxi"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Plotting"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "