Deploying to gh-pages from @ 297b0ef 🚀

TidierOrg · Jan 31, 2024 · b68503e · b68503e
commit b68503e
Show file tree

Hide file tree

Showing 28 changed files with 1,179 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+/.vscode/
+/data/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1 @@
+This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/.
diff --git a/Project.toml b/Project.toml
@@ -0,0 +1,16 @@
+authors = ["Karandeep Singh", "Christoph Scheuch"]
+
+[deps]
+CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
+FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
+PlutoUI = "7f904dfe-b85e-4ff6-b463-dae2292396a8"
+Tidier = "f0413319-3358-4bb0-8e7c-0c83523a93bd"
+ZipFile = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"
+
+[compat]
+CSV = "0.10.12"
+FileIO = "1.16.2"
+PlutoUI = "0.7.55"
+Tidier = "1.2.1"
+ZipFile = "0.10.1"
+julia = "1.9, 1.10"
diff --git a/README.md b/README.md
@@ -0,0 +1,9 @@
+# Tidier Course
+
+<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10x;" width="150"/>
+
+Welcome to the **Tidier Course**, an interactive course designed to introduce you to Julia and the Tidier.jl ecosystem for data analysis. The course consists of a series of Jupyter Notebooks so that you can both learn and practice how to write Julia code through real data science examples.
+
+This course assumes a basic level of familiarity with programming but does not assume any prior knowledge of Julia. This course emphasizes the parts of Julia required to read in, explore, and analyze data. Because this course is primarily oriented around data science, many important aspects of Julia will *not* be covered in this course.
+
+This course is currently under construction. Check back for updated content.
diff --git a/data-pipelines.html b/data-pipelines.html
@@ -0,0 +1,16 @@
+<!DOCTYPE html><html lang="en"><head><meta name="viewport" content="width=device-width"><meta charset="utf-8"><meta property='og:type' content='article'>
+
+<meta name="pluto-insertion-spot-meta">
+<meta name="theme-color" media="(prefers-color-scheme: light)" content="white"><meta name="theme-color" media="(prefers-color-scheme: dark)" content="#2a2928"><meta name="color-scheme" content="light dark"><link rel="icon" type="image/png" sizes="16x16" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon-16x16.347d2855.png" integrity="sha384-3qsGeVLdddzV9oIkj3PhXXQX2CZCjOD/CiyrPQOX6InOWw3HAHClrsQhPfX9uRAj" crossorigin="anonymous"><link rel="icon" type="image/png" sizes="32x32" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon-32x32.8789add4.png" integrity="sha384-cOe5vSoBIgKNgkUL27p9RpsGVY0uBg9PejLccDy+fR8ZD1Iv5dF1MGHjIZAIZwm6" crossorigin="anonymous"><link rel="icon" type="image/png" sizes="96x96" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon-96x96.48689391.png" integrity="sha384-TN49cYb8GyNmrZT14bsYXXo4l1x1NJeJ/EHuVAauAKsNPopPHLojijs9jFT4Vs8c" crossorigin="anonymous"><link rel="pluto-logo-big" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/logo.004c1d7c.svg" integrity="sha384-GkQkODcGxsrSRJCkeakBXihum0GUM44cwBgKyutDimectXCbCgj6Vu3jlrueqEcN" crossorigin="anonymous"><link rel="pluto-logo-small" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/favicon_unsaturated.d1387b25.svg" integrity="sha384-omwjH+Qy3hpAVf5FYd/pkaDBuVAfsEDRN7eBxEA8Ek00OAWP+aiV+GpEYk3I7lyo" crossorigin="anonymous"><script type="module" src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.0c13a924.js" integrity="sha384-q8cO+lFO46JfgnUPjKIx0Wq2bi7VUEL3IC7rlX48twUcwXr23VYfIwsIa/V7Ok+N" crossorigin="anonymous"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/juliamono.c6034ab4.css" integrity="sha384-n0za6lUXlyf4XC+nGkZWj3TLDnRbNpAcoi4PZGSlQMPoyqGa9kGY+ZXkUgZGIhQt" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.13c75024.css" integrity="sha384-+8lBWsI9ovzoUK6PJMu3Yzbi00lePOHmVdwcP81oC6/dFbzDUo0UccEZdK82demH" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/vollkorn.089565a8.css" integrity="sha384-jnV/84VtSgBLF70H+s2rxJcOUZIMDR+X/ElFZA83i9ZtZSWiIMFAgPyrWkOJV08q" crossorigin="anonymous"><script defer="">console.log("Pluto.jl, by Fons van der Plas (https://github.com/fonsp), Mikołaj Bochenski (https://github.com/malyvsen), Michiel Dral (https://github.com/dralletje) and friends 🌈");</script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.b8733d72.js" defer="" integrity="sha384-84yPd6AGZ/1IUiaBlssipmMKMFz9WGFQ+u8vYZ9cWicH6bZm7ZOej+kLDXnIIAQJ" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.9f9dc874.js" defer="" integrity="sha384-tkFo1EK72I9JvoTmHFa199dfRzW8mkXPUkHb/N7UhYI+bxKzX3Kh8LNCZz1ltsFF" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.90ede145.js" defer="" integrity="sha384-CuNU9gQg6fa/yynNqNWjHWzPm4nj+d7O6+HXsNGSqClhs/bYQIbBC3Lw/kh8Ukui" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.dbeed08a.js" defer="" integrity="sha384-1BEdQwXfZi4ZpsNV8w1X8pQcVK1/DS/+/M8OTo3gol7mdEspSN7nT6llX57NQCSt" crossorigin="anonymous"></script><script id="iframe-resizer-content-window-script" src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.6386bd9d.js" crossorigin="anonymous" defer="" integrity="sha384-tgN2a0VDi/lCYwZuDqT7L+A/Y/9kpxf3HV7zv2BJ5Fu7zW0EClq0nM4crfK3TRPs"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.a3d93540.css" type="text/css" integrity="sha384-B0KWNr8jxl8wBW7XZTO8KORIjwjXk75K5O5WsVTqbgCkx3q+9ihdMnGU8DzNtCRG" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.52bd66ba.css" type="text/css" media="all" data-pluto-file="hide-ui" integrity="sha384-mZn6RuXF1UXCTqkld9/QJshMPUFGT/EBEcr0lZfUV7TULrxk0fZqe+YHXMk+6Qb0" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.ec3a6a5b.css" type="text/css" integrity="sha384-SuGFZkuBuG+lmfz6RbnvjtcyIh8W1xDYi1sebwn7bl9VMQnhmr6EniSmIdcHJ55l" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.1f4cf2ca.css" type="text/css" integrity="sha384-lBSBsn8FT1UzGOsNVudfV8RSHQEuNWqrCb6xQnF10uvF9AiCzYsCRXvKlhtQvV3c" crossorigin="anonymous"><link rel="preload" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/juliamono.c6034ab4.css" as="style" integrity="sha384-n0za6lUXlyf4XC+nGkZWj3TLDnRbNpAcoi4PZGSlQMPoyqGa9kGY+ZXkUgZGIhQt" crossorigin="anonymous"><link rel="preload" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/vollkorn.089565a8.css" as="style" integrity="sha384-jnV/84VtSgBLF70H+s2rxJcOUZIMDR+X/ElFZA83i9ZtZSWiIMFAgPyrWkOJV08q" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.e82e08bd.css" type="text/css" integrity="sha384-7YN+h8b6N4N65qk8TG/J2KPF95D8z3sGNd06rokz4CX9oWu0KnRAF5cVWu3BkkaN" crossorigin="anonymous"><script data-pluto-file="launch-parameters">
+window.pluto_notebook_id = undefined;
+window.pluto_isolated_cell_ids = undefined;
+window.pluto_notebookfile = "data-pipelines.jl";
+window.pluto_disable_ui = true;
+window.pluto_slider_server_url = undefined;
+window.pluto_binder_url = "https://mybinder.org/v2/gh/fonsp/pluto-on-binder/v0.19.37";
+window.pluto_statefile = "data-pipelines.plutostate";
+window.pluto_preamble_html = undefined;
+</script>
+
+<meta name="pluto-insertion-spot-parameters">
+<script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.0a83627a.js" type="module" defer="" integrity="sha384-mIjQ/cXCjt6jrtkr2oANlRjS55e9oHhJbJuPAHD9ETGhIkkantDLBKdp2Lfi4vDv" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/[email protected]/frontend-dist/editor.8a3292da.js" integrity="sha384-itp4oE2PRbSrrTHVpWh8sqAuVUsz7ja6L2Dgp/JRfMCD2AwVdTk56K96POF3oLmu" crossorigin="anonymous"></script><script type="text/javascript" id="MathJax-script" integrity="sha384-4kE/rQ11E8xT9QgrCBTyvenkuPfQo8rXYQvJZuMgxyPOoUfpatjQPlgdv6V5yhUK" crossorigin="" not-the-src-yet="https://cdn.jsdelivr.net/npm/[email protected]/es5/tex-svg-full.js" async=""></script></head><body class="loading no-MαθJax"> <div style="display:flex;min-height:100vh;"> <pluto-editor class="fullscreen"></pluto-editor> </div> </body></html>
diff --git a/data-pipelines.jl b/data-pipelines.jl
@@ -0,0 +1,142 @@
+### A Pluto.jl notebook ###
+# v0.19.36
+
+using Markdown
+using InteractiveUtils
+
+# ╔═╡ d6823989-bb85-400d-87ec-2a365260f5fb
+# ╠═╡ show_logs = false
+using Pkg; Pkg.activate(".."); Pkg.instantiate()
+
+# ╔═╡ 51e24e5e-cfc7-4b02-978c-505e21e6df43
+using PlutoUI: TableOfContents
+
+# ╔═╡ 2eec5998-bb36-11ee-2283-67ea47c4f5ed
+md"""
+# Tidier Course: Data Pipelines
+"""
+
+# ╔═╡ a4baabcd-d425-449e-b7bb-f8b776582330
+html"""<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10x;" width="150"/>"""
+
+# ╔═╡ c38f82c2-def3-4d1c-bda0-54e779e2583a
+md"""
+## The Structured Query Language (SQL)
+
+Let's rewind to our benchmarks for data aggregation tasks: [https://duckdblabs.github.io/db-benchmark/](https://duckdblabs.github.io/db-benchmark/).
+"""
+
+# ╔═╡ 99aa11d7-09f8-4ebb-9166-e248fc5af44f
+html"""<img src="https://raw.githubusercontent.com/TidierOrg/TidierCourse/main/why-julia/duckdb_benchmark.jpeg" style="width:50%"/>"""
+
+# ╔═╡ 78d16051-d5d9-4f9c-9316-ce4ddee39dce
+md"""
+DuckDB and ClickHouse were two of the fastest tools, and while both are implemented in C++, their primary interface to users is in SQL. SQL is the *lingua franca* of databases, and it is important background knowledge as a data scientist to understand its syntax, which is the source of its popularity as well as its primary limitation.
+
+Let's say we have a dataset called `patients`, which has columns `diagnosis`, `takes_medications`, and `age`. Each row represents a unique patient, `diagnosis` is the primary diagnosis, `takes_medications` is a string indicating whether a patients takes any medications ("yes") or not ("no"), and `age` is their current age.
+
+To compare the mean age among patients with diabetes who take medications versus those who do not take medications, we would write the following in SQL:
+
+```sql
+SELECT takes_medications, AVG(age) AS mean_age
+FROM patients
+WHERE diagnosis = 'diabetes'
+GROUP BY takes_medications;
+```
+
+The SQL syntax is fairly intuitive in that each verb (e.g., `SELECT`) has a clear purpose, and the full query itself reads a bit like a sentence that you could read aloud. However, hidden within this apparent simplicity is the fact that SQL queries don't actually run in the order in this order.
+
+The *actual* order in which this query runs is:
+
+1. `FROM patients`
+2. `WHERE diagnosis = 'diabetes'`
+3. `GROUP BY takes_medications`
+4. `SELECT takes_medications, AVG(age) AS mean_age`
+
+If you think about this, this makes sense. You first need to start with the dataset (`FROM patients`), then you need to limit the dataset to only those rows where the primary diagnosis is diabetes (`WHERE diagnosis = 'diabetes'`). Then, after grouping by whether or not a patient takes medications, we need to calculate the mean age for each group.
+
+The key lesson with SQL is:
+
+> The order in which you write the verbs in SQL is different from the order in which the verbs are processed by SQL.
+
+Much has been written about this issue (see: [https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/](https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/) and [https://www.flerlagetwins.com/2018/10/sql-part4.html](https://www.flerlagetwins.com/2018/10/sql-part4.html)).
+
+In case you're curious, this is a more complete comparison of how SQL queries are written vs. how they are processed by SQL.
+
+| What You Write in SQL | Order In Which It Runs |
+| ----------------------|------------------------|
+| SELECT                | FROM                   |
+| DISTINCT              | JOIN                   |
+| TOP                   | WHERE                  |
+| [AGGREGATION]         | GROUP BY               |
+| FROM                  | [AGGREGATION]          |
+| JOIN                  | HAVING                 |
+| WHERE                 | SELECT                 |
+| GROUP BY              | DISTINCT               |
+| HAVING                | ORDER BY               |
+| ORDER BY              | TOP / LIMIT            | 
+"""
+
+# ╔═╡ 2686a0fb-15e1-44d8-9565-1abdee13ec5b
+md"""
+## Why not run SQL queries in the same order they are written?
+
+While the fact that SQL queries form sentences that can be read aloud is convenient, this convenience comes at a cost. When queries get more complicated, they can no longer be read aloud, and the order of operations becomes much harder to keep track of. For more complex queries, it actually becomes cognitively less demanding to keep track of queries that are run in the same order that they are written.
+
+This idea of behind `PRQL` ([https://github.com/PRQL/prql](https://github.com/PRQL/prql)), which calls itself a "simple, powerful, pipelined, SQL replacement." 
+
+This same query in PRQL would be written as:
+
+```
+from patients
+filter diagnosis == "diabetes"
+group {takes_medications}
+aggregate {age = avg age}
+```
+
+The fact that the analytic steps are written in the same order as they are performed seems trivial, but this is the big idea behind data pipelines. A data pipeline starts with a dataset, and each function transforms the data in a specific way until the end result answers an analytical question.
+"""
+
+# ╔═╡ 4fde78bb-3dc5-4849-ad24-29804a49740c
+md"""
+## Modern data pipelines
+
+Data pipelines were popularized by the `dplyr` and `ggplot2` R packages, which are two of the core packages that make up the `tidyverse` ecoystem in R. In fact, the `dplyr` R package was a key inspiration behind `PRQL` (see [https://prql-lang.org/faq/](https://prql-lang.org/faq/)). While `PRQL` brings the idea of data pipelines to a `SQL` syntax, modern data pipelines are much more expansive in their capabilities.
+
+While all data pipelines *start* with a dataset, they don't need to *end* with a dataset. Modern data pipelines often end with plots (as in `ggplot2` in R), statistical analyses, machine learning models, and more. These more advanced types of data pipelines is where SQL-like languages (like PRQL) show their limitations. While great for transforming data, SQL-like langauges do not have facilities for plotting and machine learning.
+
+Data pipelines implemented in a programming language like Python, R, or Julia are thus much more capable than in PRQL.
+"""
+
+# ╔═╡ 6a08598c-69bf-498c-9ac2-4e0a4b749598
+md"""
+## Summary
+
+- The Structured Query Language (SQL) is a popular way of working with datasets
+- SQL's simple-to-read syntax introduces complexity because the order in which SQL queries are written is different from the order in which SQL queries are run
+- PRQL is a SQL-like language that implements data pipelines
+- Data pipelines refer to data analysis pathways that start with a dataset and then sequentially transform the dataset
+- While data pipelines start with a dataset, modern data pipelines end with plots, statistical analyses, and machine learning models.
+"""
+
+# ╔═╡ 831bad3f-0e43-4226-a75c-7a7c4c569e53
+md"""
+# Appendix
+"""
+
+# ╔═╡ 0ddc3de7-c4a8-44c7-8cd3-4d63de3334c7
+TableOfContents()
+
+# ╔═╡ Cell order:
+# ╟─2eec5998-bb36-11ee-2283-67ea47c4f5ed
+# ╟─a4baabcd-d425-449e-b7bb-f8b776582330
+# ╟─c38f82c2-def3-4d1c-bda0-54e779e2583a
+# ╟─99aa11d7-09f8-4ebb-9166-e248fc5af44f
+# ╟─78d16051-d5d9-4f9c-9316-ce4ddee39dce
+# ╟─2686a0fb-15e1-44d8-9565-1abdee13ec5b
+# ╟─4fde78bb-3dc5-4849-ad24-29804a49740c
+# ╟─6a08598c-69bf-498c-9ac2-4e0a4b749598
+# ╟─831bad3f-0e43-4226-a75c-7a7c4c569e53
+# ╠═d6823989-bb85-400d-87ec-2a365260f5fb
+# ╠═51e24e5e-cfc7-4b02-978c-505e21e6df43
+# ╠═0ddc3de7-c4a8-44c7-8cd3-4d63de3334c7
diff --git a/data-pipelines.plutostate b/data-pipelines.plutostate
diff --git a/header.html b/header.html
@@ -0,0 +1,18 @@
+<h1>Tidier Course</h1>
+<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10px;" width="150"/>
+<button onclick="window.location='index.html'">Home</button>
+<button onclick="window.location='why-julia.html'">Why Julia</button>
+<button onclick="window.location='what-is-tidier.html'">What is Tidier.jl</button>
+
+<style>
+pluto-cell {
+  display: flex;
+  flex-direction: column;
+}
+pluto-cell pluto-output {
+  order: 1;
+}
+pluto-cell pluto-runarea {
+  bottom: -17px;
+}
+</style>
diff --git a/images/duckdb_benchmark.jpeg b/images/duckdb_benchmark.jpeg
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/.