Docs: Simpler Examples, generate examples pages from actual examples …

…code (#1134) * start migrating examples * restore chess dbt example * bring examples into desired shape * generate example pages from existing examples using docstrings * fix one md link * post merge file delete * add some notes for test vars * move chess example back into examples folder * skip examples without proper header * separate examples testing into own make command * prepare tests for examples and run them * fix examples test setup * add postgres dependency to snippets tests * ignore some folders * add argparse plus clear flag to example test preparation make examples raise in case of failed loads * simplify example folder skipping * add a template for a new example * fix bug in deployment * update contributing
dlt-hub · Apr 3, 2024 · 6bf1940 · 6bf1940
1 parent ee33548
commit 6bf1940
Show file tree

Hide file tree

Showing 80 changed files with 699 additions and 1,723 deletions.
diff --git a/.github/workflows/test_doc_snippets.yml b/.github/workflows/test_doc_snippets.yml
@@ -58,11 +58,19 @@ jobs:
 
       - name: Install dependencies
         # if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
-        run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery --with docs,sentry-sdk --without airflow
+        run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery -E postgres --with docs,sentry-sdk --without airflow
 
-      - name: create secrets.toml
+      - name: create secrets.toml for examples
+        run: pwd && echo "$DLT_SECRETS_TOML" > docs/examples/.dlt/secrets.toml
+
+      - name: create secrets.toml for snippets
         run: pwd && echo "$DLT_SECRETS_TOML" > docs/website/docs/.dlt/secrets.toml
 
-      - name: Run linter and tests
-        run: make test-and-lint-snippets
+      - name: Run linter and tests on examples
+        run: make lint-and-test-examples
+
+      - name: Run linter and tests on snippets
+        run: make lint-and-test-snippets
+
+
 
diff --git a/.gitignore b/.gitignore
@@ -12,7 +12,6 @@ experiments/*
 # !experiments/pipeline/
 # !experiments/pipeline/*
 secrets.toml
-!docs/**/secrets.toml
 *.session.sql
 *.duckdb
 *.wal
@@ -141,4 +140,7 @@ tmp
 **/tmp
 
 # Qdrant embedding models cache
-local_cache/
+local_cache/
+
+# test file for examples are generated and should not be committed
+docs/examples/**/test*.py
diff --git a/Makefile b/Makefile
@@ -27,7 +27,7 @@ help:
 	@echo "			tests all components using local destinations: duckdb and postgres"
 	@echo "		test-common"
 	@echo "			tests common components"
-	@echo "		test-and-lint-snippets"
+	@echo "		lint-and-test-snippets"
 	@echo "			tests and lints snippets and examples in docs"
 	@echo "		build-library"
 	@echo "			makes dev and then builds dlt package for distribution"
@@ -60,12 +60,22 @@ format:
 	poetry run black dlt docs tests --exclude=".*syntax_error.py|\.venv.*|_storage/.*"
 	# poetry run isort ./
 
-test-and-lint-snippets:
+lint-and-test-snippets:
 	cd docs/tools && poetry run python check_embedded_snippets.py full
 	poetry run mypy --config-file mypy.ini docs/website docs/examples docs/tools --exclude docs/tools/lint_setup
 	poetry run flake8 --max-line-length=200 docs/website docs/examples docs/tools
 	cd docs/website/docs && poetry run pytest --ignore=node_modules
 
+lint-and-test-examples:
+	poetry run mypy --config-file mypy.ini docs/examples
+	poetry run flake8 --max-line-length=200 docs/examples
+	cd docs/tools && poetry run python prepare_examples_tests.py
+	cd docs/examples && poetry run pytest
+
+
+test-examples:
+	cd docs/examples && poetry run pytest
+
 lint-security:
 	poetry run bandit -r dlt/ -n 3 -l
 

diff --git a/...tom_destination_bigquery/.dlt/config.toml → docs/examples/.dlt/config.toml b/...tom_destination_bigquery/.dlt/config.toml → docs/examples/.dlt/config.toml
diff --git a/docs/examples/CONTRIBUTING.md b/docs/examples/CONTRIBUTING.md
@@ -4,50 +4,27 @@ Note: All paths in this guide are relative to the `dlt` repository directory.
 
 ## Add snippet
 
-- Go to `docs/website/docs/examples/`.
-- Copy one of the examples, rename scripts.
-- Modify the script in `<example-name>/code/<snippet-name>-snippets.py`:
-    - The whole example code should be inside of `def <snippet-name>_snippet()` function.
-    - Use tags `# @@@DLT_SNIPPET_START example` and `# @@@DLT_SNIPPET_END example` to indicate which part of the code will be auto-generated in the final script `docs/examples/<examlple-name>/<snippet-name>.py`.
-    - Use additional tags as `# @@@DLT_SNIPPET_START smal_part_of_code` to indicate which part of the code will be auto-inserted into a text document `docs/website/docs/examples/<example-name>/index.md` in the form of a code snippet.
-- Modify .`dlt/secrets.toml` and `configs.toml` if needed.
-- Modify `<example-name>/index.md`:
-    - In the section `<Header info=` add the tl;dr for your example, it should be short but informative.
-    - Set `slug="<example-name>" run_file="<snippet-name>" />`.
-    - List what users will learn from this example. Use bullet points and link corresponding documentation pages.
-    - Use tags `<!--@@@DLT_SNIPPET ./code/<snippet-name>-snippets.py::smal_part_of_code-->` to insert example code snippets. Do not write them manually!
-
-## Add tests
-
-- Do not forget to add tests to `<example-name>/code/<snippet-name>-snippets.py`.
-- They could be short asserts, code should work.
-- Use `# @@@DLT_REMOVE` to remove test code from final code example.
-- Test your snippets locally first with command:
-    - `cd docs/website/docs/examples/<example-name>/code && pytest --ignore=node_modules -s -v`.
-- Add `@skipifgithubfork` decorator to your main snippet function, look [example](https://github.com/dlt-hub/dlt/blob/master/docs/website/docs/examples/chess_production/code/chess-snippets.py#L1-L4).
-
-## Run npm start
+- Go to `docs/examples/`.
+- Copy the template in `./_template/..`.
+- Make sure the folder and your examples script have the same name
+- Update the doc string which will compromise the generated markdown file, check the other examples how it is done
+- If your example requires any secrets, add the vars to the example.secrects.toml but do not enter the values.
+- Add your example code, make sure you have a `if __name__ = "__main__"` clause in which you run the example script, this will be used for testing
+- You should add one or two assertions after running your example and maybe also `load_info.raise_on_failed_jobs()`, this will help greatly with testing
+
+## Testing
+- You can test your example simply by running your example script from your example folder. On CI a test will be automatically generated.
+
+## Checking your generated markdown
 
 The command `npm start`  starts a local development server and opens up a browser window.
 
 - To install npm read [README](../website/README.md).
-- This command will generate a clean example script in `docs/examples/<examlple-name>` folder based on `docs/website/docs/examples/<example-name>/code/<snippet-name>-snippets.py`.
-- Also, this command automatically inserts code snippets to `docs/website/docs/examples/<example-name>/index.md`.
+- You should your example be automatically added to the examples section in the local version of the docs. Check the rendered output and see wether it looks the way you intended.
 
 ## Add ENV variables
 
-If you use any secrets for the code snippets, e.g. Zendesk requires credentials. You need to add them to GitHub Actions in ENV style:
-
-- First, add the variables to `.github/workflows/test_doc_snippets.yml`:
-
-    Example:
-
-    ```yaml
-    # zendesk vars for example
-    SOURCES__ZENDESK__CREDENTIALS: ${{ secrets.ZENDESK__CREDENTIALS }}
-    ```
-
-- Ask dlt team to add them to the GitHub Secrets.
+If you use any secrets for the code snippets, e.g. Zendesk requires credentials. Please talk to us. We will add them to our google secrets vault.
 
 ## Add dependencies
 

diff --git a/...ocs/examples/chess_production/__init__.py → docs/examples/__init__.py b/...ocs/examples/chess_production/__init__.py → docs/examples/__init__.py
diff --git a/docs/examples/google_sheets/.dlt/config.toml → docs/examples/_template/.dlt/config.toml b/docs/examples/google_sheets/.dlt/config.toml → docs/examples/_template/.dlt/config.toml
diff --git a/...ples/incremental_loading/.dlt/config.toml → ...mples/_template/.dlt/example.secrets.toml b/...ples/incremental_loading/.dlt/config.toml → ...mples/_template/.dlt/example.secrets.toml
diff --git a/...xamples/chess_production/code/__init__.py → docs/examples/_template/__init__.py b/...xamples/chess_production/code/__init__.py → docs/examples/_template/__init__.py
diff --git a/docs/examples/_template/_template.py b/docs/examples/_template/_template.py
@@ -0,0 +1,30 @@
+"""
+---
+title: Example Template
+description: Add desciption here
+keywords: [example]
+---
+
+This is a template for a new example. This text will show up in the docs.
+
+With this example you will learn to:
+
+* One
+* two
+* Three
+
+"""
+
+import dlt
+
+if __name__ == "__main__":
+    # run a pipeline
+    pipeline = dlt.pipeline(
+        pipeline_name="example_pipeline", destination="duckdb", dataset_name="example_data"
+    )
+    # Extract, normalize, and load the data
+    load_info = pipeline.run([1, 2, 3], table_name="player")
+    print(load_info)
+
+    # make sure nothing failed
+    load_info.raise_on_failed_jobs()
diff --git a/docs/examples/chess/chess.py b/docs/examples/chess/chess.py
@@ -1,4 +1,3 @@
-import os
 import threading
 from typing import Any, Iterator
 
@@ -49,12 +48,14 @@ def players_games(username: Any) -> Iterator[TDataItems]:
 
 if __name__ == "__main__":
     print("You must run this from the docs/examples/chess folder")
-    assert os.getcwd().endswith("chess")
     # chess_url in config.toml, credentials for postgres in secrets.toml, credentials always under credentials key
     # look for parallel run configuration in `config.toml`!
     # mind the full_refresh: it makes the pipeline to load to a distinct dataset each time it is run and always is resetting the schema and state
-    info = dlt.pipeline(
+    load_info = dlt.pipeline(
         pipeline_name="chess_games", destination="postgres", dataset_name="chess", full_refresh=True
     ).run(chess(max_players=5, month=9))
     # display where the data went
-    print(info)
+    print(load_info)
+
+    # make sure nothing failed
+    load_info.raise_on_failed_jobs()
diff --git a/docs/examples/chess_production/.dlt/config.toml b/docs/examples/chess_production/.dlt/config.toml
@@ -0,0 +1 @@
+chess_url="https://api.chess.com/pub/"
diff --git a/docs/examples/chess_production/chess.py → ...ples/chess_production/chess_production.py b/docs/examples/chess_production/chess.py → ...ples/chess_production/chess_production.py
@@ -1,10 +1,38 @@
+"""
+---
+title: Run chess pipeline in production
+description: Learn how run chess pipeline in production
+keywords: [incremental loading, example]
+---
+
+In this example, you'll find a Python script that interacts with the Chess API to extract players and game data.
+
+We'll learn how to:
+
+- Inspecting packages after they have been loaded.
+- Loading back load information, schema updates, and traces.
+- Triggering notifications in case of schema evolution.
+- Using context managers to independently retry pipeline stages.
+- Run basic tests utilizing `sql_client` and `normalize_info`.
+
+"""
+
 import threading
 from typing import Any, Iterator
 
+from tenacity import (
+    Retrying,
+    retry_if_exception,
+    stop_after_attempt,
+    wait_exponential,
+)
+
 import dlt
-from dlt.common import sleep
+from dlt.common import sleep, logger
 from dlt.common.typing import StrAny, TDataItems
 from dlt.sources.helpers.requests import client
+from dlt.pipeline.helpers import retry_load
+from dlt.common.runtime.slack import send_slack_message
 
 
 @dlt.source
@@ -44,17 +72,6 @@ def players_games(username: Any) -> Iterator[TDataItems]:
     return players(), players_profiles, players_games
 
 
-from tenacity import (
-    Retrying,
-    retry_if_exception,
-    stop_after_attempt,
-    wait_exponential,
-)
-
-from dlt.common import logger
-from dlt.common.runtime.slack import send_slack_message
-from dlt.pipeline.helpers import retry_load
-
 MAX_PLAYERS = 5
 
 
@@ -107,6 +124,7 @@ def load_data_with_retry(pipeline, data):
                 logger.info("Warning: No data in players table")
             else:
                 logger.info(f"Players table contains {count} rows")
+    assert count == MAX_PLAYERS
 
     # To run simple tests with `normalize_info`, such as checking table counts and
     # warning if there is no data, you can use the `row_counts` attribute.
@@ -116,13 +134,16 @@ def load_data_with_retry(pipeline, data):
         logger.info("Warning: No data in players table")
     else:
         logger.info(f"Players table contains {count} rows")
+    assert count == MAX_PLAYERS
 
     # we reuse the pipeline instance below and load to the same dataset as data
     logger.info("Saving the load info in the destination")
     pipeline.run([load_info], table_name="_load_info")
+    assert "_load_info" in pipeline.last_trace.last_normalize_info.row_counts
     # save trace to destination, sensitive data will be removed
     logger.info("Saving the trace in the destination")
     pipeline.run([pipeline.last_trace], table_name="_trace")
+    assert "_trace" in pipeline.last_trace.last_normalize_info.row_counts
 
     # print all the new tables/columns in
     for package in load_info.load_packages:
@@ -134,6 +155,7 @@ def load_data_with_retry(pipeline, data):
     # save the new tables and column schemas to the destination:
     table_updates = [p.asdict()["tables"] for p in load_info.load_packages]
     pipeline.run(table_updates, table_name="_new_tables")
+    assert "_new_tables" in pipeline.last_trace.last_normalize_info.row_counts
 
     return load_info
 
@@ -146,5 +168,8 @@ def load_data_with_retry(pipeline, data):
         dataset_name="chess_data",
     )
     # get data for a few famous players
-    data = chess(chess_url="https://api.chess.com/pub/", max_players=MAX_PLAYERS)
-    load_data_with_retry(pipeline, data)
+    data = chess(max_players=MAX_PLAYERS)
+    load_info = load_data_with_retry(pipeline, data)
+
+    # make sure nothing failed
+    load_info.raise_on_failed_jobs()
diff --git a/docs/examples/conftest.py b/docs/examples/conftest.py
@@ -0,0 +1,57 @@
+import os
+import pytest
+from unittest.mock import patch
+
+from dlt.common.configuration.container import Container
+from dlt.common.configuration.providers import (
+    ConfigTomlProvider,
+    EnvironProvider,
+    SecretsTomlProvider,
+    StringTomlProvider,
+)
+from dlt.common.configuration.specs.config_providers_context import (
+    ConfigProvidersContext,
+)
+from dlt.common.utils import set_working_dir
+
+from tests.utils import (
+    patch_home_dir,
+    autouse_test_storage,
+    preserve_environ,
+    duckdb_pipeline_location,
+    wipe_pipeline,
+)
+
+
+@pytest.fixture(autouse=True)
+def setup_secret_providers(request):
+    """Creates set of config providers where tomls are loaded from tests/.dlt"""
+    secret_dir = "./.dlt"
+    dname = os.path.dirname(request.module.__file__)
+    config_dir = dname + "/.dlt"
+
+    # inject provider context so the original providers are restored at the end
+    def _initial_providers():
+        return [
+            EnvironProvider(),
+            SecretsTomlProvider(project_dir=secret_dir, add_global_config=False),
+            ConfigTomlProvider(project_dir=config_dir, add_global_config=False),
+        ]
+
+    glob_ctx = ConfigProvidersContext()
+    glob_ctx.providers = _initial_providers()
+
+    with set_working_dir(dname), Container().injectable_context(glob_ctx), patch(
+        "dlt.common.configuration.specs.config_providers_context.ConfigProvidersContext.initial_providers",
+        _initial_providers,
+    ):
+        # extras work when container updated
+        glob_ctx.add_extras()
+        yield
+
+
+def pytest_configure(config):
+    # push sentry to ci
+    os.environ["RUNTIME__SENTRY_DSN"] = (
+        "https://[email protected]/4504819859914752"
+    )