Skip to content

Commit

Permalink
Docs: Simpler Examples, generate examples pages from actual examples …
Browse files Browse the repository at this point in the history
…code (#1134)

* start migrating examples

* restore chess dbt example

* bring examples into desired shape

* generate example pages from existing examples using docstrings

* fix one md link

* post merge file delete

* add some notes for test vars

* move chess example back into examples folder

* skip examples without proper header

* separate examples testing into own make command

* prepare tests for examples and run them

* fix examples test setup

* add postgres dependency to snippets tests

* ignore some folders

* add argparse plus clear flag to example test preparation
make examples raise in case of failed loads

* simplify example folder skipping

* add a template for a new example

* fix bug in deployment

* update contributing
  • Loading branch information
sh-rp authored Apr 3, 2024
1 parent ee33548 commit 6bf1940
Show file tree
Hide file tree
Showing 80 changed files with 699 additions and 1,723 deletions.
16 changes: 12 additions & 4 deletions .github/workflows/test_doc_snippets.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,19 @@ jobs:

- name: Install dependencies
# if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery --with docs,sentry-sdk --without airflow
run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery -E postgres --with docs,sentry-sdk --without airflow

- name: create secrets.toml
- name: create secrets.toml for examples
run: pwd && echo "$DLT_SECRETS_TOML" > docs/examples/.dlt/secrets.toml

- name: create secrets.toml for snippets
run: pwd && echo "$DLT_SECRETS_TOML" > docs/website/docs/.dlt/secrets.toml

- name: Run linter and tests
run: make test-and-lint-snippets
- name: Run linter and tests on examples
run: make lint-and-test-examples

- name: Run linter and tests on snippets
run: make lint-and-test-snippets



6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ experiments/*
# !experiments/pipeline/
# !experiments/pipeline/*
secrets.toml
!docs/**/secrets.toml
*.session.sql
*.duckdb
*.wal
Expand Down Expand Up @@ -141,4 +140,7 @@ tmp
**/tmp

# Qdrant embedding models cache
local_cache/
local_cache/

# test file for examples are generated and should not be committed
docs/examples/**/test*.py
14 changes: 12 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ help:
@echo " tests all components using local destinations: duckdb and postgres"
@echo " test-common"
@echo " tests common components"
@echo " test-and-lint-snippets"
@echo " lint-and-test-snippets"
@echo " tests and lints snippets and examples in docs"
@echo " build-library"
@echo " makes dev and then builds dlt package for distribution"
Expand Down Expand Up @@ -60,12 +60,22 @@ format:
poetry run black dlt docs tests --exclude=".*syntax_error.py|\.venv.*|_storage/.*"
# poetry run isort ./

test-and-lint-snippets:
lint-and-test-snippets:
cd docs/tools && poetry run python check_embedded_snippets.py full
poetry run mypy --config-file mypy.ini docs/website docs/examples docs/tools --exclude docs/tools/lint_setup
poetry run flake8 --max-line-length=200 docs/website docs/examples docs/tools
cd docs/website/docs && poetry run pytest --ignore=node_modules

lint-and-test-examples:
poetry run mypy --config-file mypy.ini docs/examples
poetry run flake8 --max-line-length=200 docs/examples
cd docs/tools && poetry run python prepare_examples_tests.py
cd docs/examples && poetry run pytest


test-examples:
cd docs/examples && poetry run pytest

lint-security:
poetry run bandit -r dlt/ -n 3 -l

Expand Down
File renamed without changes.
51 changes: 14 additions & 37 deletions docs/examples/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,50 +4,27 @@ Note: All paths in this guide are relative to the `dlt` repository directory.

## Add snippet

- Go to `docs/website/docs/examples/`.
- Copy one of the examples, rename scripts.
- Modify the script in `<example-name>/code/<snippet-name>-snippets.py`:
- The whole example code should be inside of `def <snippet-name>_snippet()` function.
- Use tags `# @@@DLT_SNIPPET_START example` and `# @@@DLT_SNIPPET_END example` to indicate which part of the code will be auto-generated in the final script `docs/examples/<examlple-name>/<snippet-name>.py`.
- Use additional tags as `# @@@DLT_SNIPPET_START smal_part_of_code` to indicate which part of the code will be auto-inserted into a text document `docs/website/docs/examples/<example-name>/index.md` in the form of a code snippet.
- Modify .`dlt/secrets.toml` and `configs.toml` if needed.
- Modify `<example-name>/index.md`:
- In the section `<Header info=` add the tl;dr for your example, it should be short but informative.
- Set `slug="<example-name>" run_file="<snippet-name>" />`.
- List what users will learn from this example. Use bullet points and link corresponding documentation pages.
- Use tags `<!--@@@DLT_SNIPPET ./code/<snippet-name>-snippets.py::smal_part_of_code-->` to insert example code snippets. Do not write them manually!

## Add tests

- Do not forget to add tests to `<example-name>/code/<snippet-name>-snippets.py`.
- They could be short asserts, code should work.
- Use `# @@@DLT_REMOVE` to remove test code from final code example.
- Test your snippets locally first with command:
- `cd docs/website/docs/examples/<example-name>/code && pytest --ignore=node_modules -s -v`.
- Add `@skipifgithubfork` decorator to your main snippet function, look [example](https://github.com/dlt-hub/dlt/blob/master/docs/website/docs/examples/chess_production/code/chess-snippets.py#L1-L4).

## Run npm start
- Go to `docs/examples/`.
- Copy the template in `./_template/..`.
- Make sure the folder and your examples script have the same name
- Update the doc string which will compromise the generated markdown file, check the other examples how it is done
- If your example requires any secrets, add the vars to the example.secrects.toml but do not enter the values.
- Add your example code, make sure you have a `if __name__ = "__main__"` clause in which you run the example script, this will be used for testing
- You should add one or two assertions after running your example and maybe also `load_info.raise_on_failed_jobs()`, this will help greatly with testing

## Testing
- You can test your example simply by running your example script from your example folder. On CI a test will be automatically generated.

## Checking your generated markdown

The command `npm start` starts a local development server and opens up a browser window.

- To install npm read [README](../website/README.md).
- This command will generate a clean example script in `docs/examples/<examlple-name>` folder based on `docs/website/docs/examples/<example-name>/code/<snippet-name>-snippets.py`.
- Also, this command automatically inserts code snippets to `docs/website/docs/examples/<example-name>/index.md`.
- You should your example be automatically added to the examples section in the local version of the docs. Check the rendered output and see wether it looks the way you intended.

## Add ENV variables

If you use any secrets for the code snippets, e.g. Zendesk requires credentials. You need to add them to GitHub Actions in ENV style:

- First, add the variables to `.github/workflows/test_doc_snippets.yml`:

Example:

```yaml
# zendesk vars for example
SOURCES__ZENDESK__CREDENTIALS: ${{ secrets.ZENDESK__CREDENTIALS }}
```
- Ask dlt team to add them to the GitHub Secrets.
If you use any secrets for the code snippets, e.g. Zendesk requires credentials. Please talk to us. We will add them to our google secrets vault.

## Add dependencies

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
30 changes: 30 additions & 0 deletions docs/examples/_template/_template.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""
---
title: Example Template
description: Add desciption here
keywords: [example]
---
This is a template for a new example. This text will show up in the docs.
With this example you will learn to:
* One
* two
* Three
"""

import dlt

if __name__ == "__main__":
# run a pipeline
pipeline = dlt.pipeline(
pipeline_name="example_pipeline", destination="duckdb", dataset_name="example_data"
)
# Extract, normalize, and load the data
load_info = pipeline.run([1, 2, 3], table_name="player")
print(load_info)

# make sure nothing failed
load_info.raise_on_failed_jobs()
9 changes: 5 additions & 4 deletions docs/examples/chess/chess.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import os
import threading
from typing import Any, Iterator

Expand Down Expand Up @@ -49,12 +48,14 @@ def players_games(username: Any) -> Iterator[TDataItems]:

if __name__ == "__main__":
print("You must run this from the docs/examples/chess folder")
assert os.getcwd().endswith("chess")
# chess_url in config.toml, credentials for postgres in secrets.toml, credentials always under credentials key
# look for parallel run configuration in `config.toml`!
# mind the full_refresh: it makes the pipeline to load to a distinct dataset each time it is run and always is resetting the schema and state
info = dlt.pipeline(
load_info = dlt.pipeline(
pipeline_name="chess_games", destination="postgres", dataset_name="chess", full_refresh=True
).run(chess(max_players=5, month=9))
# display where the data went
print(info)
print(load_info)

# make sure nothing failed
load_info.raise_on_failed_jobs()
1 change: 1 addition & 0 deletions docs/examples/chess_production/.dlt/config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
chess_url="https://api.chess.com/pub/"
Original file line number Diff line number Diff line change
@@ -1,10 +1,38 @@
"""
---
title: Run chess pipeline in production
description: Learn how run chess pipeline in production
keywords: [incremental loading, example]
---
In this example, you'll find a Python script that interacts with the Chess API to extract players and game data.
We'll learn how to:
- Inspecting packages after they have been loaded.
- Loading back load information, schema updates, and traces.
- Triggering notifications in case of schema evolution.
- Using context managers to independently retry pipeline stages.
- Run basic tests utilizing `sql_client` and `normalize_info`.
"""

import threading
from typing import Any, Iterator

from tenacity import (
Retrying,
retry_if_exception,
stop_after_attempt,
wait_exponential,
)

import dlt
from dlt.common import sleep
from dlt.common import sleep, logger
from dlt.common.typing import StrAny, TDataItems
from dlt.sources.helpers.requests import client
from dlt.pipeline.helpers import retry_load
from dlt.common.runtime.slack import send_slack_message


@dlt.source
Expand Down Expand Up @@ -44,17 +72,6 @@ def players_games(username: Any) -> Iterator[TDataItems]:
return players(), players_profiles, players_games


from tenacity import (
Retrying,
retry_if_exception,
stop_after_attempt,
wait_exponential,
)

from dlt.common import logger
from dlt.common.runtime.slack import send_slack_message
from dlt.pipeline.helpers import retry_load

MAX_PLAYERS = 5


Expand Down Expand Up @@ -107,6 +124,7 @@ def load_data_with_retry(pipeline, data):
logger.info("Warning: No data in players table")
else:
logger.info(f"Players table contains {count} rows")
assert count == MAX_PLAYERS

# To run simple tests with `normalize_info`, such as checking table counts and
# warning if there is no data, you can use the `row_counts` attribute.
Expand All @@ -116,13 +134,16 @@ def load_data_with_retry(pipeline, data):
logger.info("Warning: No data in players table")
else:
logger.info(f"Players table contains {count} rows")
assert count == MAX_PLAYERS

# we reuse the pipeline instance below and load to the same dataset as data
logger.info("Saving the load info in the destination")
pipeline.run([load_info], table_name="_load_info")
assert "_load_info" in pipeline.last_trace.last_normalize_info.row_counts
# save trace to destination, sensitive data will be removed
logger.info("Saving the trace in the destination")
pipeline.run([pipeline.last_trace], table_name="_trace")
assert "_trace" in pipeline.last_trace.last_normalize_info.row_counts

# print all the new tables/columns in
for package in load_info.load_packages:
Expand All @@ -134,6 +155,7 @@ def load_data_with_retry(pipeline, data):
# save the new tables and column schemas to the destination:
table_updates = [p.asdict()["tables"] for p in load_info.load_packages]
pipeline.run(table_updates, table_name="_new_tables")
assert "_new_tables" in pipeline.last_trace.last_normalize_info.row_counts

return load_info

Expand All @@ -146,5 +168,8 @@ def load_data_with_retry(pipeline, data):
dataset_name="chess_data",
)
# get data for a few famous players
data = chess(chess_url="https://api.chess.com/pub/", max_players=MAX_PLAYERS)
load_data_with_retry(pipeline, data)
data = chess(max_players=MAX_PLAYERS)
load_info = load_data_with_retry(pipeline, data)

# make sure nothing failed
load_info.raise_on_failed_jobs()
57 changes: 57 additions & 0 deletions docs/examples/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import os
import pytest
from unittest.mock import patch

from dlt.common.configuration.container import Container
from dlt.common.configuration.providers import (
ConfigTomlProvider,
EnvironProvider,
SecretsTomlProvider,
StringTomlProvider,
)
from dlt.common.configuration.specs.config_providers_context import (
ConfigProvidersContext,
)
from dlt.common.utils import set_working_dir

from tests.utils import (
patch_home_dir,
autouse_test_storage,
preserve_environ,
duckdb_pipeline_location,
wipe_pipeline,
)


@pytest.fixture(autouse=True)
def setup_secret_providers(request):
"""Creates set of config providers where tomls are loaded from tests/.dlt"""
secret_dir = "./.dlt"
dname = os.path.dirname(request.module.__file__)
config_dir = dname + "/.dlt"

# inject provider context so the original providers are restored at the end
def _initial_providers():
return [
EnvironProvider(),
SecretsTomlProvider(project_dir=secret_dir, add_global_config=False),
ConfigTomlProvider(project_dir=config_dir, add_global_config=False),
]

glob_ctx = ConfigProvidersContext()
glob_ctx.providers = _initial_providers()

with set_working_dir(dname), Container().injectable_context(glob_ctx), patch(
"dlt.common.configuration.specs.config_providers_context.ConfigProvidersContext.initial_providers",
_initial_providers,
):
# extras work when container updated
glob_ctx.add_extras()
yield


def pytest_configure(config):
# push sentry to ci
os.environ["RUNTIME__SENTRY_DSN"] = (
"https://[email protected]/4504819859914752"
)
Loading

0 comments on commit 6bf1940

Please sign in to comment.