Skip to content

Commit

Permalink
chore: 🎨 add tutorial (#35)
Browse files Browse the repository at this point in the history
  • Loading branch information
davidgasquez authored Nov 10, 2023
1 parent b706031 commit 58376d4
Show file tree
Hide file tree
Showing 8 changed files with 407 additions and 43 deletions.
4 changes: 4 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ project:
- README.md
- reports
- dashboard.qmd
- docs

format:
html:
Expand All @@ -32,6 +33,9 @@ website:
- text: Knowledge Base
icon: book
href: reports
- text: Tutorial
icon: bi-journal
href: docs/tutorial.html
tools:
- icon: twitter
href: https://twitter.com/davidgasquez
Expand Down
2 changes: 1 addition & 1 deletion dashboard.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ This Dashboard was made by [Bob Rudis](https://dailyfinds.hrbrmstr.dev/p/drop-36

```{ojs}
//| output: false
jsonURL = "https://raw.githubusercontent.com/davidgasquez/datadex/gh-pages/country-data.json"
jsonURL = "https://bafybeihossdpesleq77dzptgtu23hfoayl4g73lvwjqxq65ngvzypz6rp4.ipfs.w3s.link/ipfs/bafybeihossdpesleq77dzptgtu23hfoayl4g73lvwjqxq65ngvzypz6rp4/country-data.json"
countryData = await fetch(jsonURL).then(response => response.json())
```

Expand Down
8 changes: 8 additions & 0 deletions datadex/assets.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,11 @@ def raw_threatened_animal_species() -> pd.DataFrame:
"https://raw.githubusercontent.com/datonic/threatened-animal-species/main/datapackage.yaml"
)
return p.get_resource("threatened-species").to_pandas() # type: ignore


@asset
def raw_owid_co2_data() -> pd.DataFrame:
co2_owid_url = (
"https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
)
return pd.read_csv(co2_owid_url)
12 changes: 12 additions & 0 deletions datadex/utils.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,14 @@
import os

import duckdb

db_dir = os.path.dirname(os.path.abspath(__file__)) + "/../data/"


def custom_f():
return 42


def query(sql):
with duckdb.connect(database=f"{db_dir}/local.duckdb") as con:
return con.sql(sql).df()
1 change: 1 addition & 0 deletions dbt/models/climate/climate_owid_co2_by_country.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
select country, iso_code, year, co2 from {{ source("public", "raw_owid_co2_data") }}
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,7 @@ sources:
meta:
dagster:
asset_key: ["raw_threatened_animal_species"]
- name: raw_owid_co2_data
meta:
dagster:
asset_key: ["raw_owid_co2_data"]
63 changes: 63 additions & 0 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Datadex Tutorial

Let's ingest and model some open data. We'll cover all the basics to get you started with Datadex. If you're not familiar with [dbt](https://docs.getdbt.com/) or [Dagster](Dagster), I recommend you to check their tutorials to get a sense of how these tools work.

## 📦 Adding Data Sources

The first thing is to add your desired dataset to Datadex. To do that, you'll need to create a new Dagster Asset in `assets.py`. You'll need to write a Python function that returns a DataFrame. You can do anything and read from anywhere as long as you return a DataFrame.

```python
@asset
def raw_owid_co2_data() -> pd.DataFrame:
co2_owid_url = "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
return df.read_csv(co2_owid_url)
```

This will make a new asset appear in the Dagster UI (available at [localhost:3000](http://127.0.0.1:3000/) after running `make dev`). You can now select it and click "Materialize selected" to run the function and save the resulting DataFrame to our local DuckDB database.

Once the asset is materialized, you can start querying it.


```python
>>> from datadex.utils import query
>>> query("select count(*) from public.raw_owid_co2_data")
count_star()
0 50598
```

## 📊 Modeling Data

Once the data is available in the local DuckDB database, you can start modeling it. You can continue using Dagster or switch to dbt. Let's explore the dbt side now.

We want to make dbt able to read the dataset Dagster materialized. To do that, we need to add a new table source to the `sources.yml`:

```yaml
version: 2
sources:
- name: public
- name: raw_owid_co2_data
meta:
dagster:
asset_key: ["raw_owid_co2_data"]
```
Now we can create our SQL models referencing the source we just created. This is a simple query on `climate_owid_co2_by_country.sql`:

```sql
select country, iso_code, year, co2 from {{ source("public", "raw_owid_co2_data") }}
```

To run this model, we need to refresh the Dagster definitions on `Reload definitions` and materialize the new `dbt` node. That will kick off a dbt run and materialize the resulting table as parquet files (due to the `external` materialization in the `dbt_project.yml` configuration).

## 📈 Using Data

Finally, we can use the data in a notebook. Let's say we want to plot the CO2 emissions for a given country. We can use the `climate_owid_co2_by_country` table we just created:

```python
from datadex.utils import query
df = query("select * from climate_owid_co2_by_country where country = 'World'")
_ = df.plot(x="year", y="co2", kind="line")
```

That will plot the CO2 emissions for the whole world!
356 changes: 314 additions & 42 deletions reports/2023-01-01-Datadex.ipynb

Large diffs are not rendered by default.

0 comments on commit 58376d4

Please sign in to comment.