Skip to content

Commit

Permalink
Deploying to gh-pages from @ 297b0ef 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
kdpsingh committed Jan 31, 2024
0 parents commit b68503e
Show file tree
Hide file tree
Showing 28 changed files with 1,179 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/.vscode/
/data/
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/.
16 changes: 16 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
authors = ["Karandeep Singh", "Christoph Scheuch"]

[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
PlutoUI = "7f904dfe-b85e-4ff6-b463-dae2292396a8"
Tidier = "f0413319-3358-4bb0-8e7c-0c83523a93bd"
ZipFile = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"

[compat]
CSV = "0.10.12"
FileIO = "1.16.2"
PlutoUI = "0.7.55"
Tidier = "1.2.1"
ZipFile = "0.10.1"
julia = "1.9, 1.10"
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Tidier Course

<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10x;" width="150"/>

Welcome to the **Tidier Course**, an interactive course designed to introduce you to Julia and the Tidier.jl ecosystem for data analysis. The course consists of a series of Jupyter Notebooks so that you can both learn and practice how to write Julia code through real data science examples.

This course assumes a basic level of familiarity with programming but does not assume any prior knowledge of Julia. This course emphasizes the parts of Julia required to read in, explore, and analyze data. Because this course is primarily oriented around data science, many important aspects of Julia will *not* be covered in this course.

This course is currently under construction. Check back for updated content.
16 changes: 16 additions & 0 deletions data-pipelines.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<!DOCTYPE html><html lang="en"><head><meta name="viewport" content="width=device-width"><meta charset="utf-8"><meta property='og:type' content='article'>

<meta name="pluto-insertion-spot-meta">
<meta name="theme-color" media="(prefers-color-scheme: light)" content="white"><meta name="theme-color" media="(prefers-color-scheme: dark)" content="#2a2928"><meta name="color-scheme" content="light dark"><link rel="icon" type="image/png" sizes="16x16" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon-16x16.347d2855.png" integrity="sha384-3qsGeVLdddzV9oIkj3PhXXQX2CZCjOD/CiyrPQOX6InOWw3HAHClrsQhPfX9uRAj" crossorigin="anonymous"><link rel="icon" type="image/png" sizes="32x32" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon-32x32.8789add4.png" integrity="sha384-cOe5vSoBIgKNgkUL27p9RpsGVY0uBg9PejLccDy+fR8ZD1Iv5dF1MGHjIZAIZwm6" crossorigin="anonymous"><link rel="icon" type="image/png" sizes="96x96" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon-96x96.48689391.png" integrity="sha384-TN49cYb8GyNmrZT14bsYXXo4l1x1NJeJ/EHuVAauAKsNPopPHLojijs9jFT4Vs8c" crossorigin="anonymous"><link rel="pluto-logo-big" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/logo.004c1d7c.svg" integrity="sha384-GkQkODcGxsrSRJCkeakBXihum0GUM44cwBgKyutDimectXCbCgj6Vu3jlrueqEcN" crossorigin="anonymous"><link rel="pluto-logo-small" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon_unsaturated.d1387b25.svg" integrity="sha384-omwjH+Qy3hpAVf5FYd/pkaDBuVAfsEDRN7eBxEA8Ek00OAWP+aiV+GpEYk3I7lyo" crossorigin="anonymous"><script type="module" src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.0c13a924.js" integrity="sha384-q8cO+lFO46JfgnUPjKIx0Wq2bi7VUEL3IC7rlX48twUcwXr23VYfIwsIa/V7Ok+N" crossorigin="anonymous"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/juliamono.c6034ab4.css" integrity="sha384-n0za6lUXlyf4XC+nGkZWj3TLDnRbNpAcoi4PZGSlQMPoyqGa9kGY+ZXkUgZGIhQt" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.13c75024.css" integrity="sha384-+8lBWsI9ovzoUK6PJMu3Yzbi00lePOHmVdwcP81oC6/dFbzDUo0UccEZdK82demH" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/vollkorn.089565a8.css" integrity="sha384-jnV/84VtSgBLF70H+s2rxJcOUZIMDR+X/ElFZA83i9ZtZSWiIMFAgPyrWkOJV08q" crossorigin="anonymous"><script defer="">console.log("Pluto.jl, by Fons van der Plas (https://github.com/fonsp), Mikołaj Bochenski (https://github.com/malyvsen), Michiel Dral (https://github.com/dralletje) and friends 🌈");</script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.b8733d72.js" defer="" integrity="sha384-84yPd6AGZ/1IUiaBlssipmMKMFz9WGFQ+u8vYZ9cWicH6bZm7ZOej+kLDXnIIAQJ" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.9f9dc874.js" defer="" integrity="sha384-tkFo1EK72I9JvoTmHFa199dfRzW8mkXPUkHb/N7UhYI+bxKzX3Kh8LNCZz1ltsFF" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.90ede145.js" defer="" integrity="sha384-CuNU9gQg6fa/yynNqNWjHWzPm4nj+d7O6+HXsNGSqClhs/bYQIbBC3Lw/kh8Ukui" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.dbeed08a.js" defer="" integrity="sha384-1BEdQwXfZi4ZpsNV8w1X8pQcVK1/DS/+/M8OTo3gol7mdEspSN7nT6llX57NQCSt" crossorigin="anonymous"></script><script id="iframe-resizer-content-window-script" src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.6386bd9d.js" crossorigin="anonymous" defer="" integrity="sha384-tgN2a0VDi/lCYwZuDqT7L+A/Y/9kpxf3HV7zv2BJ5Fu7zW0EClq0nM4crfK3TRPs"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.a3d93540.css" type="text/css" integrity="sha384-B0KWNr8jxl8wBW7XZTO8KORIjwjXk75K5O5WsVTqbgCkx3q+9ihdMnGU8DzNtCRG" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.52bd66ba.css" type="text/css" media="all" data-pluto-file="hide-ui" integrity="sha384-mZn6RuXF1UXCTqkld9/QJshMPUFGT/EBEcr0lZfUV7TULrxk0fZqe+YHXMk+6Qb0" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.ec3a6a5b.css" type="text/css" integrity="sha384-SuGFZkuBuG+lmfz6RbnvjtcyIh8W1xDYi1sebwn7bl9VMQnhmr6EniSmIdcHJ55l" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.1f4cf2ca.css" type="text/css" integrity="sha384-lBSBsn8FT1UzGOsNVudfV8RSHQEuNWqrCb6xQnF10uvF9AiCzYsCRXvKlhtQvV3c" crossorigin="anonymous"><link rel="preload" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/juliamono.c6034ab4.css" as="style" integrity="sha384-n0za6lUXlyf4XC+nGkZWj3TLDnRbNpAcoi4PZGSlQMPoyqGa9kGY+ZXkUgZGIhQt" crossorigin="anonymous"><link rel="preload" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/vollkorn.089565a8.css" as="style" integrity="sha384-jnV/84VtSgBLF70H+s2rxJcOUZIMDR+X/ElFZA83i9ZtZSWiIMFAgPyrWkOJV08q" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.e82e08bd.css" type="text/css" integrity="sha384-7YN+h8b6N4N65qk8TG/J2KPF95D8z3sGNd06rokz4CX9oWu0KnRAF5cVWu3BkkaN" crossorigin="anonymous"><script data-pluto-file="launch-parameters">
window.pluto_notebook_id = undefined;
window.pluto_isolated_cell_ids = undefined;
window.pluto_notebookfile = "data-pipelines.jl";
window.pluto_disable_ui = true;
window.pluto_slider_server_url = undefined;
window.pluto_binder_url = "https://mybinder.org/v2/gh/fonsp/pluto-on-binder/v0.19.37";
window.pluto_statefile = "data-pipelines.plutostate";
window.pluto_preamble_html = undefined;
</script>

<meta name="pluto-insertion-spot-parameters">
<script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.0a83627a.js" type="module" defer="" integrity="sha384-mIjQ/cXCjt6jrtkr2oANlRjS55e9oHhJbJuPAHD9ETGhIkkantDLBKdp2Lfi4vDv" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.8a3292da.js" integrity="sha384-itp4oE2PRbSrrTHVpWh8sqAuVUsz7ja6L2Dgp/JRfMCD2AwVdTk56K96POF3oLmu" crossorigin="anonymous"></script><script type="text/javascript" id="MathJax-script" integrity="sha384-4kE/rQ11E8xT9QgrCBTyvenkuPfQo8rXYQvJZuMgxyPOoUfpatjQPlgdv6V5yhUK" crossorigin="" not-the-src-yet="https://cdn.jsdelivr.net/npm/[email protected]/es5/tex-svg-full.js" async=""></script></head><body class="loading no-MαθJax"> <div style="display:flex;min-height:100vh;"> <pluto-editor class="fullscreen"></pluto-editor> </div> </body></html>
142 changes: 142 additions & 0 deletions data-pipelines.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
### A Pluto.jl notebook ###
# v0.19.36

using Markdown
using InteractiveUtils

# ╔═╡ d6823989-bb85-400d-87ec-2a365260f5fb
# ╠═╡ show_logs = false
using Pkg; Pkg.activate(".."); Pkg.instantiate()

# ╔═╡ 51e24e5e-cfc7-4b02-978c-505e21e6df43
using PlutoUI: TableOfContents

# ╔═╡ 2eec5998-bb36-11ee-2283-67ea47c4f5ed
md"""
# Tidier Course: Data Pipelines
"""

# ╔═╡ a4baabcd-d425-449e-b7bb-f8b776582330
html"""<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10x;" width="150"/>"""

# ╔═╡ c38f82c2-def3-4d1c-bda0-54e779e2583a
md"""
## The Structured Query Language (SQL)
Let's rewind to our benchmarks for data aggregation tasks: [https://duckdblabs.github.io/db-benchmark/](https://duckdblabs.github.io/db-benchmark/).
"""

# ╔═╡ 99aa11d7-09f8-4ebb-9166-e248fc5af44f
html"""<img src="https://raw.githubusercontent.com/TidierOrg/TidierCourse/main/why-julia/duckdb_benchmark.jpeg" style="width:50%"/>"""

# ╔═╡ 78d16051-d5d9-4f9c-9316-ce4ddee39dce
md"""
DuckDB and ClickHouse were two of the fastest tools, and while both are implemented in C++, their primary interface to users is in SQL. SQL is the *lingua franca* of databases, and it is important background knowledge as a data scientist to understand its syntax, which is the source of its popularity as well as its primary limitation.
Let's say we have a dataset called `patients`, which has columns `diagnosis`, `takes_medications`, and `age`. Each row represents a unique patient, `diagnosis` is the primary diagnosis, `takes_medications` is a string indicating whether a patients takes any medications ("yes") or not ("no"), and `age` is their current age.
To compare the mean age among patients with diabetes who take medications versus those who do not take medications, we would write the following in SQL:
```sql
SELECT takes_medications, AVG(age) AS mean_age
FROM patients
WHERE diagnosis = 'diabetes'
GROUP BY takes_medications;
```
The SQL syntax is fairly intuitive in that each verb (e.g., `SELECT`) has a clear purpose, and the full query itself reads a bit like a sentence that you could read aloud. However, hidden within this apparent simplicity is the fact that SQL queries don't actually run in the order in this order.
The *actual* order in which this query runs is:
1. `FROM patients`
2. `WHERE diagnosis = 'diabetes'`
3. `GROUP BY takes_medications`
4. `SELECT takes_medications, AVG(age) AS mean_age`
If you think about this, this makes sense. You first need to start with the dataset (`FROM patients`), then you need to limit the dataset to only those rows where the primary diagnosis is diabetes (`WHERE diagnosis = 'diabetes'`). Then, after grouping by whether or not a patient takes medications, we need to calculate the mean age for each group.
The key lesson with SQL is:
> The order in which you write the verbs in SQL is different from the order in which the verbs are processed by SQL.
Much has been written about this issue (see: [https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/](https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/) and [https://www.flerlagetwins.com/2018/10/sql-part4.html](https://www.flerlagetwins.com/2018/10/sql-part4.html)).
In case you're curious, this is a more complete comparison of how SQL queries are written vs. how they are processed by SQL.
| What You Write in SQL | Order In Which It Runs |
| ----------------------|------------------------|
| SELECT | FROM |
| DISTINCT | JOIN |
| TOP | WHERE |
| [AGGREGATION] | GROUP BY |
| FROM | [AGGREGATION] |
| JOIN | HAVING |
| WHERE | SELECT |
| GROUP BY | DISTINCT |
| HAVING | ORDER BY |
| ORDER BY | TOP / LIMIT |
"""

# ╔═╡ 2686a0fb-15e1-44d8-9565-1abdee13ec5b
md"""
## Why not run SQL queries in the same order they are written?
While the fact that SQL queries form sentences that can be read aloud is convenient, this convenience comes at a cost. When queries get more complicated, they can no longer be read aloud, and the order of operations becomes much harder to keep track of. For more complex queries, it actually becomes cognitively less demanding to keep track of queries that are run in the same order that they are written.
This idea of behind `PRQL` ([https://github.com/PRQL/prql](https://github.com/PRQL/prql)), which calls itself a "simple, powerful, pipelined, SQL replacement."
This same query in PRQL would be written as:
```
from patients
filter diagnosis == "diabetes"
group {takes_medications}
aggregate {age = avg age}
```
The fact that the analytic steps are written in the same order as they are performed seems trivial, but this is the big idea behind data pipelines. A data pipeline starts with a dataset, and each function transforms the data in a specific way until the end result answers an analytical question.
"""

# ╔═╡ 4fde78bb-3dc5-4849-ad24-29804a49740c
md"""
## Modern data pipelines
Data pipelines were popularized by the `dplyr` and `ggplot2` R packages, which are two of the core packages that make up the `tidyverse` ecoystem in R. In fact, the `dplyr` R package was a key inspiration behind `PRQL` (see [https://prql-lang.org/faq/](https://prql-lang.org/faq/)). While `PRQL` brings the idea of data pipelines to a `SQL` syntax, modern data pipelines are much more expansive in their capabilities.
While all data pipelines *start* with a dataset, they don't need to *end* with a dataset. Modern data pipelines often end with plots (as in `ggplot2` in R), statistical analyses, machine learning models, and more. These more advanced types of data pipelines is where SQL-like languages (like PRQL) show their limitations. While great for transforming data, SQL-like langauges do not have facilities for plotting and machine learning.
Data pipelines implemented in a programming language like Python, R, or Julia are thus much more capable than in PRQL.
"""

# ╔═╡ 6a08598c-69bf-498c-9ac2-4e0a4b749598
md"""
## Summary
- The Structured Query Language (SQL) is a popular way of working with datasets
- SQL's simple-to-read syntax introduces complexity because the order in which SQL queries are written is different from the order in which SQL queries are run
- PRQL is a SQL-like language that implements data pipelines
- Data pipelines refer to data analysis pathways that start with a dataset and then sequentially transform the dataset
- While data pipelines start with a dataset, modern data pipelines end with plots, statistical analyses, and machine learning models.
"""

# ╔═╡ 831bad3f-0e43-4226-a75c-7a7c4c569e53
md"""
# Appendix
"""

# ╔═╡ 0ddc3de7-c4a8-44c7-8cd3-4d63de3334c7
TableOfContents()

# ╔═╡ Cell order:
# ╟─2eec5998-bb36-11ee-2283-67ea47c4f5ed
# ╟─a4baabcd-d425-449e-b7bb-f8b776582330
# ╟─c38f82c2-def3-4d1c-bda0-54e779e2583a
# ╟─99aa11d7-09f8-4ebb-9166-e248fc5af44f
# ╟─78d16051-d5d9-4f9c-9316-ce4ddee39dce
# ╟─2686a0fb-15e1-44d8-9565-1abdee13ec5b
# ╟─4fde78bb-3dc5-4849-ad24-29804a49740c
# ╟─6a08598c-69bf-498c-9ac2-4e0a4b749598
# ╟─831bad3f-0e43-4226-a75c-7a7c4c569e53
# ╠═d6823989-bb85-400d-87ec-2a365260f5fb
# ╠═51e24e5e-cfc7-4b02-978c-505e21e6df43
# ╠═0ddc3de7-c4a8-44c7-8cd3-4d63de3334c7
Binary file added data-pipelines.plutostate
Binary file not shown.
18 changes: 18 additions & 0 deletions header.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<h1>Tidier Course</h1>
<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10px;" width="150"/>
<button onclick="window.location='index.html'">Home</button>
<button onclick="window.location='why-julia.html'">Why Julia</button>
<button onclick="window.location='what-is-tidier.html'">What is Tidier.jl</button>

<style>
pluto-cell {
display: flex;
flex-direction: column;
}
pluto-cell pluto-output {
order: 1;
}
pluto-cell pluto-runarea {
bottom: -17px;
}
</style>
Binary file added images/duckdb_benchmark.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit b68503e

Please sign in to comment.