chore: 🎨 add tutorial (#35)

datonic · Nov 10, 2023 · 58376d4 · 58376d4
1 parent b706031
commit 58376d4
Show file tree

Hide file tree

Showing 8 changed files with 407 additions and 43 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -7,6 +7,7 @@ project:
     - README.md
     - reports
     - dashboard.qmd
+    - docs
 
 format:
   html:
@@ -32,6 +33,9 @@ website:
       - text: Knowledge Base
         icon: book
         href: reports
+      - text: Tutorial
+        icon: bi-journal
+        href: docs/tutorial.html
     tools:
       - icon: twitter
         href: https://twitter.com/davidgasquez

diff --git a/dashboard.qmd b/dashboard.qmd
@@ -12,7 +12,7 @@ This Dashboard was made by [Bob Rudis](https://dailyfinds.hrbrmstr.dev/p/drop-36
 
 ```{ojs}
 //| output: false
-jsonURL = "https://raw.githubusercontent.com/davidgasquez/datadex/gh-pages/country-data.json"
+jsonURL = "https://bafybeihossdpesleq77dzptgtu23hfoayl4g73lvwjqxq65ngvzypz6rp4.ipfs.w3s.link/ipfs/bafybeihossdpesleq77dzptgtu23hfoayl4g73lvwjqxq65ngvzypz6rp4/country-data.json"
 countryData = await fetch(jsonURL).then(response => response.json())
 ```
 

diff --git a/datadex/assets.py b/datadex/assets.py
@@ -9,3 +9,11 @@ def raw_threatened_animal_species() -> pd.DataFrame:
         "https://raw.githubusercontent.com/datonic/threatened-animal-species/main/datapackage.yaml"
     )
     return p.get_resource("threatened-species").to_pandas()  # type: ignore
+
+
+@asset
+def raw_owid_co2_data() -> pd.DataFrame:
+    co2_owid_url = (
+        "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
+    )
+    return pd.read_csv(co2_owid_url)
diff --git a/datadex/utils.py b/datadex/utils.py
@@ -1,2 +1,14 @@
+import os
+
+import duckdb
+
+db_dir = os.path.dirname(os.path.abspath(__file__)) + "/../data/"
+
+
 def custom_f():
     return 42
+
+
+def query(sql):
+    with duckdb.connect(database=f"{db_dir}/local.duckdb") as con:
+        return con.sql(sql).df()
diff --git a/dbt/models/climate/climate_owid_co2_by_country.sql b/dbt/models/climate/climate_owid_co2_by_country.sql
@@ -0,0 +1 @@
+select country, iso_code, year, co2 from {{ source("public", "raw_owid_co2_data") }}
diff --git a/dbt/models/climate/sources.yml → dbt/models/sources.yml b/dbt/models/climate/sources.yml → dbt/models/sources.yml
@@ -20,3 +20,7 @@ sources:
         meta:
           dagster:
             asset_key: ["raw_threatened_animal_species"]
+      - name: raw_owid_co2_data
+        meta:
+          dagster:
+            asset_key: ["raw_owid_co2_data"]
diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -0,0 +1,63 @@
+# Datadex Tutorial
+
+Let's ingest and model some open data. We'll cover all the basics to get you started with Datadex. If you're not familiar with [dbt](https://docs.getdbt.com/) or [Dagster](Dagster), I recommend you to check their tutorials to get a sense of how these tools work.
+
+## 📦 Adding Data Sources
+
+The first thing is to add your desired dataset to Datadex. To do that, you'll need to create a new Dagster Asset in `assets.py`. You'll need to write a Python function that returns a DataFrame. You can do anything and read from anywhere as long as you return a DataFrame.
+
+```python
+@asset
+def raw_owid_co2_data() -> pd.DataFrame:
+    co2_owid_url = "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
+    return df.read_csv(co2_owid_url)
+```
+
+This will make a new asset appear in the Dagster UI (available at [localhost:3000](http://127.0.0.1:3000/) after running `make dev`). You can now select it and click "Materialize selected" to run the function and save the resulting DataFrame to our local DuckDB database.
+
+Once the asset is materialized, you can start querying it.
+
+
+```python
+>>> from datadex.utils import query
+>>> query("select count(*) from public.raw_owid_co2_data")
+   count_star()
+0         50598
+```
+
+## 📊 Modeling Data
+
+Once the data is available in the local DuckDB database, you can start modeling it. You can continue using Dagster or switch to dbt. Let's explore the dbt side now.
+
+We want to make dbt able to read the dataset Dagster materialized. To do that, we need to add a new table source to the `sources.yml`:
+
+```yaml
+version: 2
+sources:
+  - name: public
+      - name: raw_owid_co2_data
+        meta:
+          dagster:
+            asset_key: ["raw_owid_co2_data"]
+```
+
+Now we can create our SQL models referencing the source we just created. This is a simple query on `climate_owid_co2_by_country.sql`:
+
+```sql
+select country, iso_code, year, co2 from {{ source("public", "raw_owid_co2_data") }}
+```
+
+To run this model, we need to refresh the Dagster definitions on `Reload definitions` and materialize the new `dbt` node. That will kick off a dbt run and materialize the resulting table as parquet files (due to the `external` materialization in the `dbt_project.yml` configuration).
+
+## 📈 Using Data
+
+Finally, we can use the data in a notebook. Let's say we want to plot the CO2 emissions for a given country. We can use the `climate_owid_co2_by_country` table we just created:
+
+```python
+from datadex.utils import query
+
+df = query("select * from climate_owid_co2_by_country where country = 'World'")
+_ = df.plot(x="year", y="co2", kind="line")
+```
+
+That will plot the CO2 emissions for the whole world!
diff --git a/reports/2023-01-01-Datadex.ipynb b/reports/2023-01-01-Datadex.ipynb
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		select country, iso_code, year, co2 from {{ source("public", "raw_owid_co2_data") }}