Straightforward project to consume OSM data, DuckDB and DBT for data transformation to analize metadata. Puts result to S3 public storage to be used in some dashboards/warehouses. Data management approach is not consistent due to the main purpose - testing tools.
- Download full changesets history file from planet OSM
- Download internal(!) osm pbf file(s) for country of interest (Poland, for example)
- Split changesets history to smaller osm.bz2 files
- Transform changesets files to parquet ones
- Load parquet into local DuckDB and apply transformations via dbt models
- Perform analyzis
- Load regular OSM data for more analyzis
- Put results to public S3 storage
- Make some visualizations
- ....
- PROFIT!!
The project framework is based on geocint-runner - geodata ETL/CI/CD pipeline:
- GNU Make is used as job server
- make-profiler is used as linter and preprocessor for Make
- GNU Parallel is used for paralleling tasks
- osmium-tool and other classical GIS tools are used
Core differences:
- DuckDB + quackosm is used instead of PostgreSQL + Postgis
- Db data transformations (at least final ones) are managed by dbt
Make sure you have around 75G disk space free for it's current state. Some estimates might be foud in targets comments.
To be updated!
- sudo apt-get install parallel
- pip install aria2
Please follow official documentation
- pip install duckdb --upgrade
- pip install quackosm ## (with spatial, json and shellfs extensions)
- pip install quackosm[cli]
- pip install dbt-core dbt-duckdb
Step 1-8 from Steps Overview done (in some way). Parquet files preparation is ready:
D select filename, count(*) cnt from read_parquet('data/out/parquet/changesets_*.parquet', filename = true) group by 1 order by 1;
┌───────────────────────────────────────────────┬──────────┐
│ filename │ cnt │
│ varchar │ int64 │
├───────────────────────────────────────────────┼──────────┤
│ data/out/parquet/changesets_2005_2012.parquet │ 13777528 │
│ data/out/parquet/changesets_2013_2015.parquet │ 20731429 │
│ data/out/parquet/changesets_2016.parquet │ 8420095 │
│ data/out/parquet/changesets_2017.parquet │ 10091179 │
│ data/out/parquet/changesets_2018.parquet │ 10756788 │
│ data/out/parquet/changesets_2019.parquet │ 13068229 │
│ data/out/parquet/changesets_2020.parquet │ 17588235 │
│ data/out/parquet/changesets_2021.parquet │ 18808719 │
│ data/out/parquet/changesets_2022.parquet │ 15036354 │
│ data/out/parquet/changesets_2023.parquet │ 14889321 │
│ data/out/parquet/changesets_2024.parquet │ 12775670 │
│ data/out/parquet/changesets_latest.parquet │ 659191 │
├───────────────────────────────────────────────┴──────────┤
│ 12 rows 2 columns │
└──────────────────────────────────────────────────────────┘
Even test dashboard is available in public notebook !