From ce701b549999135e32f2f593594fd48e734560ec Mon Sep 17 00:00:00 2001 From: David Scharf Date: Mon, 18 Mar 2024 13:19:06 +0100 Subject: [PATCH] check embedded code blocks (#1093) * first version of embedded snippets check * add missing code block types where needed * small change to snippets script * fix all parser problems in code blocks * add better error messages and add check to ci * add linting of embedded snippets * small improvement for snippets linting * remove one ignored error code * add ruff dep * add mypy (comment out for now) * fix bug in script * ignore lint setup for embedded snippets * fix linting and small mypy adjustments * switches from shell to sh as shell block type * make snippet checker code nicer * small script changes and readme * add lint and type check count --- Makefile | 5 +- docs/tools/README.md | 37 + docs/tools/__init__.py | 0 docs/tools/check_embedded_snippets.py | 332 + docs/tools/lint_setup/.gitignore | 1 + docs/tools/lint_setup/__init__.py | 0 docs/tools/lint_setup/template.py | 35 + docs/tools/mypy.ini | 4 + docs/tools/ruff.toml | 2 + .../blog/2023-10-09-dlt-ops-startups.md | 2 +- ...01-15-dlt-dbt-runner-on-cloud-functions.md | 4 +- .../website/docs/build-a-pipeline-tutorial.md | 20 +- .../docs/dlt-ecosystem/destinations/athena.md | 8 +- .../dlt-ecosystem/destinations/bigquery.md | 14 +- .../dlt-ecosystem/destinations/databricks.md | 10 +- .../dlt-ecosystem/destinations/destination.md | 12 +- .../docs/dlt-ecosystem/destinations/duckdb.md | 16 +- .../dlt-ecosystem/destinations/filesystem.md | 8 +- .../dlt-ecosystem/destinations/motherduck.md | 12 +- .../docs/dlt-ecosystem/destinations/mssql.md | 10 +- .../dlt-ecosystem/destinations/postgres.md | 14 +- .../docs/dlt-ecosystem/destinations/qdrant.md | 20 +- .../dlt-ecosystem/destinations/redshift.md | 10 +- .../dlt-ecosystem/destinations/snowflake.md | 18 +- .../dlt-ecosystem/destinations/synapse.md | 12 +- .../dlt-ecosystem/destinations/weaviate.md | 22 +- .../file-formats/insert-format.md | 2 +- .../docs/dlt-ecosystem/file-formats/jsonl.md | 2 +- .../dlt-ecosystem/file-formats/parquet.md | 4 +- docs/website/docs/dlt-ecosystem/staging.md | 4 +- .../dlt-ecosystem/transformations/dbt/dbt.md | 2 +- .../transformations/dbt/dbt_cloud.md | 8 +- .../dlt-ecosystem/transformations/pandas.md | 2 +- .../docs/dlt-ecosystem/transformations/sql.md | 10 +- .../verified-sources/airtable.md | 26 +- .../verified-sources/amazon_kinesis.md | 21 +- .../verified-sources/arrow-pandas.md | 8 +- .../dlt-ecosystem/verified-sources/asana.md | 24 +- .../dlt-ecosystem/verified-sources/chess.md | 24 +- .../verified-sources/facebook_ads.md | 39 +- .../verified-sources/filesystem.md | 52 +- .../dlt-ecosystem/verified-sources/github.md | 27 +- .../verified-sources/google_analytics.md | 26 +- .../verified-sources/google_sheets.md | 48 +- .../dlt-ecosystem/verified-sources/hubspot.md | 30 +- .../dlt-ecosystem/verified-sources/inbox.md | 26 +- .../dlt-ecosystem/verified-sources/jira.md | 24 +- .../dlt-ecosystem/verified-sources/kafka.md | 19 +- .../dlt-ecosystem/verified-sources/matomo.md | 30 +- .../dlt-ecosystem/verified-sources/mongodb.md | 38 +- .../dlt-ecosystem/verified-sources/mux.md | 28 +- .../dlt-ecosystem/verified-sources/notion.md | 19 +- .../verified-sources/personio.md | 25 +- .../verified-sources/pipedrive.md | 28 +- .../verified-sources/salesforce.md | 23 +- .../dlt-ecosystem/verified-sources/shopify.md | 25 +- .../dlt-ecosystem/verified-sources/slack.md | 31 +- .../verified-sources/sql_database.md | 38 +- .../dlt-ecosystem/verified-sources/strapi.md | 17 +- .../dlt-ecosystem/verified-sources/stripe.md | 29 +- .../verified-sources/workable.md | 28 +- .../dlt-ecosystem/verified-sources/zendesk.md | 50 +- .../visualizations/exploring-the-data.md | 10 +- .../docs/examples/chess_production/index.md | 2 +- .../docs/examples/google_sheets/index.md | 2 +- .../docs/examples/nested_data/index.md | 2 +- .../docs/examples/pdf_to_weaviate/index.md | 2 +- .../docs/examples/qdrant_zendesk/index.md | 6 +- .../credentials/config_providers.md | 4 +- .../general-usage/credentials/config_specs.md | 34 +- .../credentials/configuration.md | 104 +- .../pseudonymizing_columns.md | 2 +- .../customising-pipelines/removing_columns.md | 10 +- .../customising-pipelines/renaming_columns.md | 2 +- .../currency_conversion_data_enrichment.md | 16 +- .../url-parser-data-enrichment.md | 68 +- .../user_agent_device_data_enrichment.md | 88 +- .../website/docs/general-usage/destination.md | 2 +- .../docs/general-usage/full-loading.md | 2 +- .../docs/general-usage/incremental-loading.md | 64 +- docs/website/docs/general-usage/pipeline.md | 10 +- docs/website/docs/general-usage/resource.md | 50 +- .../docs/general-usage/schema-contracts.md | 10 +- docs/website/docs/general-usage/schema.md | 4 +- docs/website/docs/general-usage/source.md | 26 +- docs/website/docs/general-usage/state.md | 2 +- docs/website/docs/getting-started.md | 12 +- .../docs/reference/command-line-interface.md | 36 +- docs/website/docs/reference/installation.md | 18 +- docs/website/docs/reference/performance.md | 12 +- docs/website/docs/reference/telemetry.md | 6 +- .../docs/running-in-production/alerting.md | 2 +- .../docs/running-in-production/monitoring.md | 10 +- .../docs/running-in-production/running.md | 32 +- .../docs/running-in-production/tracing.md | 2 +- .../docs/tutorial/grouping-resources.md | 22 +- .../docs/tutorial/load-data-from-an-api.md | 8 +- .../walkthroughs/add-a-verified-source.md | 20 +- .../docs/walkthroughs/add_credentials.md | 2 +- .../docs/walkthroughs/adjust-a-schema.md | 6 +- .../docs/walkthroughs/create-a-pipeline.md | 20 +- .../walkthroughs/create-new-destination.md | 6 +- .../deploy-gcp-cloud-function-as-webhook.md | 6 +- .../deploy-with-airflow-composer.md | 48 +- .../deploy-with-github-actions.md | 10 +- .../deploy-with-google-cloud-functions.md | 6 +- .../dispatch-to-multiple-tables.md | 8 +- .../docs/walkthroughs/run-a-pipeline.md | 22 +- .../docs/walkthroughs/share-a-dataset.md | 18 +- .../docs/walkthroughs/zendesk-weaviate.md | 20 +- poetry.lock | 8310 +++++++++-------- pyproject.toml | 6 +- 112 files changed, 5660 insertions(+), 4995 deletions(-) create mode 100644 docs/tools/README.md create mode 100644 docs/tools/__init__.py create mode 100644 docs/tools/check_embedded_snippets.py create mode 100644 docs/tools/lint_setup/.gitignore create mode 100644 docs/tools/lint_setup/__init__.py create mode 100644 docs/tools/lint_setup/template.py create mode 100644 docs/tools/mypy.ini create mode 100644 docs/tools/ruff.toml diff --git a/Makefile b/Makefile index 5aa2b2786c..4cc19f1ae5 100644 --- a/Makefile +++ b/Makefile @@ -60,8 +60,9 @@ format: # poetry run isort ./ test-and-lint-snippets: - poetry run mypy --config-file mypy.ini docs/website docs/examples - poetry run flake8 --max-line-length=200 docs/website docs/examples + cd docs/tools && poetry run python check_embedded_snippets.py full + poetry run mypy --config-file mypy.ini docs/website docs/examples docs/tools --exclude docs/tools/lint_setup + poetry run flake8 --max-line-length=200 docs/website docs/examples docs/tools cd docs/website/docs && poetry run pytest --ignore=node_modules lint-security: diff --git a/docs/tools/README.md b/docs/tools/README.md new file mode 100644 index 0000000000..78fd0aff43 --- /dev/null +++ b/docs/tools/README.md @@ -0,0 +1,37 @@ +# DLT docs tools + +## `check_embedded_snippets.py` +This script find's all embedded snippets in our docs, extracts them and performs the following check: + +* Snippet must have a valid language set, e.g. ```py +* Snippet must be parseable (works for py, toml, yaml and json snippets) +* Snippet must pass linting (works for py) +* Coming soon: snippet must pass type checking + +This script is run on CI to ensure code quality in our docs. + +### Usage + +```sh +# Run a full check on all snippets +python check_embedded_snippets.py full + +# Show all available commands and arguments for this script +python check_embedded_snippets.py --help + +# Only run the linting stage +python check_embedded_snippets.py lint + +# Run all stages but only for snippets in files that have the string "walkthrough" in the filepath +# you will probably be using this a lot when working on one doc page +python check_embedded_snippets.py full -f walkthrough + +# Run the parsing stage, but only on snippets 49, 345 and 789 +python check_embedded_snippets.py parse -s 49,345,789 + +# run all checks but with a bit more output to the terminal +python check_embedded_snippets.py full -v +``` + +### Snippet numbers +Each snippet will be assigned an index in the order it is encountered. This is useful during creation of new snippets in the docs to selectively only run a few snippets. These numbers will change as snippets are inserted into the docs. diff --git a/docs/tools/__init__.py b/docs/tools/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/docs/tools/check_embedded_snippets.py b/docs/tools/check_embedded_snippets.py new file mode 100644 index 0000000000..663166d0c0 --- /dev/null +++ b/docs/tools/check_embedded_snippets.py @@ -0,0 +1,332 @@ +""" +Walks through all markdown files, finds all code snippets, and checks wether they are parseable. +""" +from typing import List, Dict, Optional + +import os, ast, json, yaml, tomlkit, subprocess, argparse # noqa: I251 +from dataclasses import dataclass +from textwrap import dedent + +import dlt.cli.echo as fmt + +DOCS_DIR = "../website/docs" + +SNIPPET_MARKER = "```" +ALLOWED_LANGUAGES = ["py", "toml", "json", "yaml", "text", "sh", "bat", "sql"] + +LINT_TEMPLATE = "./lint_setup/template.py" +LINT_FILE = "./lint_setup/lint_me.py" + +ENABLE_MYPY = False + + +@dataclass +class Snippet: + index: int + language: str + code: str + file: str + line: int + + def __str__(self) -> str: + return ( + f"Snippet No. {self.index} in {self.file} at line {self.line} with language" + f" {self.language}" + ) + + +def collect_markdown_files(verbose: bool) -> List[str]: + """ + Discovers all docs markdown files + """ + markdown_files: List[str] = [] + for path, _, files in os.walk(DOCS_DIR): + if "api_reference" in path: + continue + if "jaffle_shop" in path: + continue + for file in files: + if file.endswith(".md"): + markdown_files.append(os.path.join(path, file)) + if verbose: + fmt.echo(f"Discovered {os.path.join(path, file)}") + + if len(markdown_files) < 50: # sanity check + fmt.error("Found too few files. Something went wrong.") + exit(1) + + fmt.note(f"Discovered {len(markdown_files)} markdown files") + + return markdown_files + + +def collect_snippets(markdown_files: List[str], verbose: bool) -> List[Snippet]: + """ + Extract all snippets from markdown files + """ + snippets: List[Snippet] = [] + index = 0 + for file in markdown_files: + # go line by line and find all code blocks + with open(file, "r", encoding="utf-8") as f: + current_snippet: Snippet = None + lint_count = 0 + for line in f.readlines(): + lint_count += 1 + if line.strip().startswith(SNIPPET_MARKER): + if current_snippet: + # process snippet + snippets.append(current_snippet) + current_snippet.code = dedent(current_snippet.code) + current_snippet = None + else: + # start new snippet + index += 1 + current_snippet = Snippet( + index=index, + language=line.strip().split(SNIPPET_MARKER)[1] or "unknown", + code="", + file=file, + line=lint_count, + ) + elif current_snippet: + current_snippet.code += line + assert not current_snippet, ( + "It seems that the last snippet in the file was not closed. Please check the file " + + file + ) + + fmt.note(f"Discovered {len(snippets)} snippets") + if verbose: + for lang in ALLOWED_LANGUAGES: + lang_count = len([s for s in snippets if s.language == lang]) + fmt.echo(f"Found {lang_count} snippets marked as {lang}") + if len(snippets) < 100: # sanity check + fmt.error("Found too few snippets. Something went wrong.") + exit(1) + return snippets + + +def filter_snippets(snippets: List[Snippet], files: str, snippet_numbers: str) -> List[Snippet]: + """ + Filter out snippets based on file or snippet number + """ + fmt.secho(fmt.bold("Filtering Snippets")) + filtered_snippets: List[Snippet] = [] + filtered_count = 0 + for snippet in snippets: + if files and (files not in snippet.file): + filtered_count += 1 + continue + elif snippet_numbers and (str(snippet.index) not in snippet_numbers): + filtered_count += 1 + continue + filtered_snippets.append(snippet) + if filtered_count: + fmt.note( + f"{filtered_count} Snippets skipped based on file and snippet number settings." + f" {len(filtered_snippets)} snippets remaining." + ) + else: + fmt.note("0 Snippets skipped based on file and snippet number settings") + + if len(filtered_snippets) == 0: # sanity check + fmt.error("No snippets remaining after filter, nothing to do.") + exit(1) + return filtered_snippets + + +def check_language(snippets: List[Snippet]) -> None: + """ + Check if the language is allowed + """ + fmt.secho(fmt.bold("Checking snippets language settings")) + failed_count = 0 + for snippet in snippets: + if snippet.language not in ALLOWED_LANGUAGES: + fmt.warning(f"{str(snippet)} has an invalid language {snippet.language} setting.") + failed_count += 1 + + if failed_count: + fmt.error(f"""\ +Found {failed_count} snippets with invalid language settings. +* Please choose the correct language for your snippets: {ALLOWED_LANGUAGES}" +* All sh commands, except for windows (bat), should be marked as sh. +* All code blocks that are not a specific (markup-) language should be marked as text.\ +""") + exit(1) + else: + fmt.note("All snippets have valid language settings") + + +def clear(): + fmt.echo("\r" + " " * 200 + "\r", nl=False) + + +def parse_snippets(snippets: List[Snippet], verbose: bool) -> None: + """ + Parse all snippets with the respective parser library + """ + fmt.secho(fmt.bold("Parsing snippets")) + failed_count = 0 + for snippet in snippets: + # parse snippet by type + clear() + fmt.echo(f"\rParsing {snippet}", nl=False) + try: + if snippet.language == "py": + ast.parse(snippet.code) + elif snippet.language == "toml": + tomlkit.loads(snippet.code) + elif snippet.language == "json": + json.loads(snippet.code) + elif snippet.language == "yaml": + yaml.safe_load(snippet.code) + # ignore text and sh scripts + elif snippet.language in ["text", "sh", "bat", "sql"]: + pass + else: + raise ValueError(f"Unknown language {snippet.language}") + except Exception as exc: + clear() + fmt.warning(f"Failed to parse {str(snippet)}") + fmt.echo(exc) + failed_count += 1 + + clear() + if failed_count: + fmt.error(f"Failed to parse {failed_count} snippets") + exit(1) + else: + fmt.note("All snippets could be parsed") + + +def prepare_for_linting(snippet: Snippet) -> None: + """ + Prepare the lintme file with the snippet code and the template header + """ + with open(LINT_TEMPLATE, "r", encoding="utf-8") as f: + lint_template = f.read() + with open(LINT_FILE, "w", encoding="utf-8") as f: + f.write(lint_template) + f.write("# Snippet start\n\n") + f.write(snippet.code) + + +def lint_snippets(snippets: List[Snippet], verbose: bool) -> None: + """ + Lint all python snippets with ruff + """ + fmt.secho(fmt.bold("Linting Python snippets")) + failed_count = 0 + count = 0 + for snippet in snippets: + count += 1 + prepare_for_linting(snippet) + result = subprocess.run(["ruff", "check", LINT_FILE], capture_output=True, text=True) + clear() + fmt.echo(f"\rLinting {snippet} ({count} of {len(snippets)})", nl=False) + if "error" in result.stdout.lower(): + failed_count += 1 + clear() + fmt.warning(f"Failed to lint {str(snippet)}") + fmt.echo(result.stdout.strip()) + + clear() + if failed_count: + fmt.error(f"Failed to lint {failed_count} snippets") + exit(1) + else: + fmt.note("All snippets could be linted") + + +def typecheck_snippets(snippets: List[Snippet], verbose: bool) -> None: + """ + TODO: Type check all python snippets with mypy + """ + fmt.secho(fmt.bold("Type checking Python snippets")) + failed_count = 0 + count = 0 + for snippet in snippets: + count += 1 + clear() + fmt.echo(f"\rType checking {snippet} ({count} of {len(snippets)})", nl=False) + prepare_for_linting(snippet) + result = subprocess.run(["mypy", LINT_FILE], capture_output=True, text=True) + if "no issues found" not in result.stdout.lower(): + failed_count += 1 + clear() + fmt.warning(f"Failed to type check {str(snippet)}") + fmt.echo(result.stdout.strip()) + + clear() + if failed_count: + fmt.error(f"Failed to type check {failed_count} snippets") + exit(1) + else: + fmt.note("All snippets passed type checking") + + +if __name__ == "__main__": + fmt.note( + "Welcome to Snippet Checker 3000, run 'python check_embedded_snippets.py --help' for help." + ) + + # setup cli + parser = argparse.ArgumentParser( + description=( + "Check embedded snippets. Discover, parse, lint, and type check all code snippets in" + " the docs." + ), + formatter_class=argparse.ArgumentDefaultsHelpFormatter, + ) + parser.add_argument( + "command", + help=( + 'Which checks to run. "full" will run all checks, parse, lint or typecheck will only' + " run that specific step" + ), + choices=["full", "parse", "lint", "typecheck"], + default="full", + ) + parser.add_argument("-v", "--verbose", help="Increase output verbosity", action="store_true") + parser.add_argument( + "-f", + "--files", + help="Filter .md files to files containing this string in filename", + type=str, + ) + parser.add_argument( + "-s", + "--snippetnumbers", + help=( + "Filter checked snippets to snippetnumbers contained in this string, example:" + ' "13,412,345"' + ), + type=lambda i: i.split(","), + default=None, + ) + + args = parser.parse_args() + + fmt.secho(fmt.bold("Discovering snippets")) + + # find all markdown files and collect all snippets + markdown_files = collect_markdown_files(args.verbose) + snippets = collect_snippets(markdown_files, args.verbose) + + # check language settings + check_language(snippets) + + # filter snippets + filtered_snippets = filter_snippets(snippets, args.files, args.snippetnumbers) + + if args.command in ["parse", "full"]: + parse_snippets(filtered_snippets, args.verbose) + + # these stages are python only + python_snippets = [s for s in filtered_snippets if s.language == "py"] + if args.command in ["lint", "full"]: + lint_snippets(python_snippets, args.verbose) + if ENABLE_MYPY and args.command in ["typecheck", "full"]: + typecheck_snippets(python_snippets, args.verbose) diff --git a/docs/tools/lint_setup/.gitignore b/docs/tools/lint_setup/.gitignore new file mode 100644 index 0000000000..27479bdb04 --- /dev/null +++ b/docs/tools/lint_setup/.gitignore @@ -0,0 +1 @@ +lint_me.py \ No newline at end of file diff --git a/docs/tools/lint_setup/__init__.py b/docs/tools/lint_setup/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/docs/tools/lint_setup/template.py b/docs/tools/lint_setup/template.py new file mode 100644 index 0000000000..dcfada63f6 --- /dev/null +++ b/docs/tools/lint_setup/template.py @@ -0,0 +1,35 @@ +# This section is imported before linting + +# mypy: disable-error-code="name-defined,import-not-found,import-untyped,empty-body,no-redef" + +# some universal imports +from typing import Optional, Dict, List, Any, Iterable, Iterator, Tuple, Sequence, Callable + +import os + +import pendulum +from pendulum import DateTime +from datetime import datetime # noqa: I251 + +import dlt +from dlt.common import json +from dlt.common.typing import TimedeltaSeconds, TAnyDateTime, TDataItem, TDataItems +from dlt.common.schema.typing import TTableSchema, TTableSchemaColumns + +from dlt.common.pipeline import LoadInfo +from dlt.sources.helpers import requests +from dlt.extract import DltResource, DltSource +from dlt.common.configuration.specs import ( + GcpServiceAccountCredentials, + ConnectionStringCredentials, + OAuth2Credentials, + BaseConfiguration, +) +from dlt.common.storages.configuration import FileSystemCredentials + +# some universal variables +pipeline: dlt.Pipeline = None # type: ignore[assignment] +p: dlt.Pipeline = None # type: ignore[assignment] +ex: Exception = None # type: ignore[assignment] +load_info: LoadInfo = None # type: ignore[assignment] +url: str = None # type: ignore[assignment] diff --git a/docs/tools/mypy.ini b/docs/tools/mypy.ini new file mode 100644 index 0000000000..167ad5b30e --- /dev/null +++ b/docs/tools/mypy.ini @@ -0,0 +1,4 @@ +[mypy] +ignore_missing_imports = True +no_implicit_optional = False +strict_optional = False \ No newline at end of file diff --git a/docs/tools/ruff.toml b/docs/tools/ruff.toml new file mode 100644 index 0000000000..96f9432ecc --- /dev/null +++ b/docs/tools/ruff.toml @@ -0,0 +1,2 @@ +[lint] +ignore = ["F811", "F821", "F401", "F841", "E402"] diff --git a/docs/website/blog/2023-10-09-dlt-ops-startups.md b/docs/website/blog/2023-10-09-dlt-ops-startups.md index c48fd9ed95..94c1ff662b 100644 --- a/docs/website/blog/2023-10-09-dlt-ops-startups.md +++ b/docs/website/blog/2023-10-09-dlt-ops-startups.md @@ -112,7 +112,7 @@ Customize the INVOICE_QUERIES dictionary in the `unstructured_data/settings.py` And now the magic happens. Use the following command to run the pipeline: -```shell +```sh python unstructured_data_pipeline.py ``` diff --git a/docs/website/blog/2024-01-15-dlt-dbt-runner-on-cloud-functions.md b/docs/website/blog/2024-01-15-dlt-dbt-runner-on-cloud-functions.md index 227c466d37..b36748aed9 100644 --- a/docs/website/blog/2024-01-15-dlt-dbt-runner-on-cloud-functions.md +++ b/docs/website/blog/2024-01-15-dlt-dbt-runner-on-cloud-functions.md @@ -132,7 +132,7 @@ We recommend setting up and testing dbt-core locally before using it in cloud fu 1. Finally, you can deploy the function using gcloud CLI as: - ```shell + ```sh gcloud functions deploy YOUR_FUNCTION_NAME \ --gen2 \ --region=YOUR_REGION \ @@ -313,7 +313,7 @@ To integrate dlt and dbt in cloud functions, use the dlt-dbt runner; here’s ho 1. Finally, you can deploy the function using gcloud CLI as: - ```shell + ```sh gcloud functions deploy YOUR_FUNCTION_NAME \ --gen2 \ --region=YOUR_REGION \ diff --git a/docs/website/docs/build-a-pipeline-tutorial.md b/docs/website/docs/build-a-pipeline-tutorial.md index 90a175777f..1522761609 100644 --- a/docs/website/docs/build-a-pipeline-tutorial.md +++ b/docs/website/docs/build-a-pipeline-tutorial.md @@ -36,7 +36,7 @@ scalable extraction via micro-batching and parallelism. ## The simplest pipeline: 1 liner to load data with schema evolution -```python +```py import dlt dlt.pipeline(destination='duckdb', dataset_name='mydata').run([{'id': 1, 'name': 'John'}], table_name="users") @@ -52,7 +52,7 @@ named "three". With `dlt`, you can create a pipeline and run it with just a few 1. [Create a pipeline](walkthroughs/create-a-pipeline.md) to the [destination](dlt-ecosystem/destinations). 1. Give this pipeline data and [run it](walkthroughs/run-a-pipeline.md). -```python +```py import dlt pipeline = dlt.pipeline(destination="duckdb", dataset_name="country_data") @@ -84,7 +84,7 @@ In this example, we also run a dbt package and then load the outcomes of the loa This will enable us to log when schema changes occurred and match them to the loaded data for lineage, granting us both column and row level lineage. We also alert the schema change to a Slack channel where hopefully the producer and consumer are subscribed. -```python +```py import dlt # have data? dlt likes data @@ -105,7 +105,7 @@ load_info = pipeline.run( ) ``` Add dbt runner, optionally with venv: -```python +```py venv = dlt.dbt.get_venv(pipeline) dbt = dlt.dbt.package( pipeline, @@ -122,7 +122,7 @@ pipeline.run([models_info], table_name="transform_status", write_disposition='ap ``` Let's alert any schema changes: -```python +```py from dlt.common.runtime.slack import send_slack_message slack_hook = "https://hooks.slack.com/services/xxx/xxx/xxx" @@ -211,7 +211,7 @@ that only one instance of each event is present. You can use the merge write disposition as follows: -```python +```py @dlt.resource(primary_key="id", write_disposition="merge") def github_repo_events(): yield from _get_event_pages() @@ -260,7 +260,7 @@ into DAGs, providing cross-database compatibility and various features such as t backfills, testing, and troubleshooting. You can use the dbt runner in `dlt` to seamlessly integrate dbt into your pipeline. Here's an example of running a dbt package after loading the data: -```python +```py import dlt from pipedrive import pipedrive_source @@ -275,7 +275,7 @@ load_info = pipeline.run(pipedrive_source()) print(load_info) ``` Now transform from loaded data to dbt dataset: -```python +```py pipeline = dlt.pipeline( pipeline_name='pipedrive', destination='bigquery', @@ -306,7 +306,7 @@ transformations using SQL statements. You can execute SQL statements that change or manipulate data within tables. Here's an example of inserting a row into the `customers` table using the `dlt` SQL client: -```python +```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") with pipeline.sql_client() as client: @@ -324,7 +324,7 @@ You can fetch query results as Pandas data frames and perform transformations us functionalities. Here's an example of reading data from the `issues` table in DuckDB and counting reaction types using Pandas: -```python +```py pipeline = dlt.pipeline( pipeline_name="github_pipeline", destination="duckdb", diff --git a/docs/website/docs/dlt-ecosystem/destinations/athena.md b/docs/website/docs/dlt-ecosystem/destinations/athena.md index b376337e77..26be75869b 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/athena.md +++ b/docs/website/docs/dlt-ecosystem/destinations/athena.md @@ -10,7 +10,7 @@ The Athena destination stores data as Parquet files in S3 buckets and creates [e ## Install dlt with Athena **To install the DLT library with Athena dependencies:** -``` +```sh pip install dlt[athena] ``` @@ -18,7 +18,7 @@ pip install dlt[athena] ### 1. Initialize the dlt project Let's start by initializing a new `dlt` project as follows: - ```bash + ```sh dlt init chess athena ``` > 💡 This command will initialize your pipeline with chess as the source and AWS Athena as the destination using the filesystem staging destination. @@ -27,7 +27,7 @@ Let's start by initializing a new `dlt` project as follows: ### 2. Setup bucket storage and Athena credentials First, install dependencies by running: -``` +```sh pip install -r requirements.txt ``` or with `pip install dlt[athena]`, which will install `s3fs`, `pyarrow`, `pyathena`, and `botocore` packages. @@ -122,7 +122,7 @@ If you decide to change the [filename layout](./filesystem#data-loading) from th ### Iceberg data tables You can save your tables as Iceberg tables to Athena. This will enable you, for example, to delete data from them later if you need to. To switch a resource to the iceberg table format, supply the table_format argument like this: -```python +```py @dlt.resource(table_format="iceberg") def data() -> Iterable[TDataItem]: ... diff --git a/docs/website/docs/dlt-ecosystem/destinations/bigquery.md b/docs/website/docs/dlt-ecosystem/destinations/bigquery.md index e852bfa9e5..4144707b03 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/bigquery.md +++ b/docs/website/docs/dlt-ecosystem/destinations/bigquery.md @@ -10,7 +10,7 @@ keywords: [bigquery, destination, data warehouse] **To install the DLT library with BigQuery dependencies:** -``` +```sh pip install dlt[bigquery] ``` @@ -18,13 +18,13 @@ pip install dlt[bigquery] **1. Initialize a project with a pipeline that loads to BigQuery by running:** -``` +```sh dlt init chess bigquery ``` **2. Install the necessary dependencies for BigQuery by running:** -``` +```sh pip install -r requirements.txt ``` @@ -67,7 +67,7 @@ A `JSON` file that includes your service account private key will then be downlo Open your `dlt` credentials file: -``` +```sh open .dlt/secrets.toml ``` @@ -166,7 +166,7 @@ Alternatively to parquet files, you can specify jsonl as the staging file format ### BigQuery/GCS Staging Example -```python +```py # Create a dlt pipeline that will load # chess player data to the BigQuery destination # via a GCS bucket. @@ -217,7 +217,7 @@ The adapter updates the DltResource with metadata about the destination column a Here is an example of how to use the `bigquery_adapter` method to apply hints to a resource on both the column level and table level: -```python +```py from datetime import date, timedelta import dlt @@ -258,7 +258,7 @@ Some things to note with the adapter's behavior: Note that `bigquery_adapter` updates the resource *inplace*, but returns the resource for convenience, i.e. both the following are valid: -```python +```py bigquery_adapter(my_resource, partition="partition_column_name") my_resource = bigquery_adapter(my_resource, partition="partition_column_name") ``` diff --git a/docs/website/docs/dlt-ecosystem/destinations/databricks.md b/docs/website/docs/dlt-ecosystem/destinations/databricks.md index d00c603c14..8078d2c64d 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/databricks.md +++ b/docs/website/docs/dlt-ecosystem/destinations/databricks.md @@ -11,7 +11,7 @@ keywords: [Databricks, destination, data warehouse] ## Install dlt with Databricks **To install the DLT library with Databricks dependencies:** -``` +```sh pip install dlt[databricks] ``` @@ -91,12 +91,12 @@ If you already have your Databricks workspace set up, you can skip to the [Loade ## Loader setup Guide **1. Initialize a project with a pipeline that loads to Databricks by running** -``` +```sh dlt init chess databricks ``` **2. Install the necessary dependencies for Databricks by running** -``` +```sh pip install -r requirements.txt ``` This will install dlt with **databricks** extra which contains Databricks Python dbapi client. @@ -148,7 +148,7 @@ Please refer to the [S3 documentation](./filesystem.md#aws-s3) for details on co Example to set up Databricks with S3 as a staging destination: -```python +```py import dlt # Create a dlt pipeline that will load @@ -168,7 +168,7 @@ Refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure Example to set up Databricks with Azure as a staging destination: -```python +```py # Create a dlt pipeline that will load # chess player data to the Databricks destination # via staging on Azure Blob Storage diff --git a/docs/website/docs/dlt-ecosystem/destinations/destination.md b/docs/website/docs/dlt-ecosystem/destinations/destination.md index e00bbdfc38..174eaa7837 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/destination.md +++ b/docs/website/docs/dlt-ecosystem/destinations/destination.md @@ -19,7 +19,7 @@ you can do this here too. ## Install dlt for Sink / reverse ETL ** To install the DLT without additional dependencies ** -``` +```sh pip install dlt ``` @@ -28,7 +28,7 @@ pip install dlt Let's start by initializing a new dlt project as follows: -```bash +```sh dlt init chess sink ``` > 💡 This command will initialize your pipeline with chess as the source and sink as the destination. @@ -42,7 +42,7 @@ With the `@dlt.destination` decorator you can convert A very simple dlt pipeline that pushes a list of items into a sink function might look like this: -```python +```py from dlt.common.typing import TDataItems from dlt.common.schema import TTableSchema @@ -68,7 +68,7 @@ the sink from your pipeline constructor. Now you can run your pipeline and see t The full signature of the destination decorator plus its function is the following: -```python +```py @dlt.destination(batch_size=10, loader_file_format="jsonl", name="my_sink", naming="direct") def sink(items: TDataItems, table: TTableSchema) -> None: ... @@ -93,7 +93,7 @@ how table and column names are normalized. The default is `direct` which will ke ## Adding config variables and secrets The destination decorator supports settings and secrets variables. If you, for example, plan to connect to a service that requires an api secret or a login, you can do the following: -```python +```py @dlt.destination(batch_size=10, loader_file_format="jsonl", name="my_sink") def my_sink(items: TDataItems, table: TTableSchema, api_key: dlt.secrets.value) -> None: ... @@ -124,7 +124,7 @@ reasons we recommend to keep the multithreaded approach and make sure that you, ## Referencing the sink function There are multiple ways to reference the sink function you want to use. These are: -```python +```py # file my_pipeline.py @dlt.destination(batch_size=10) diff --git a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md index 9452a80c50..63b4aecd80 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/duckdb.md +++ b/docs/website/docs/dlt-ecosystem/destinations/duckdb.md @@ -8,24 +8,24 @@ keywords: [duckdb, destination, data warehouse] ## Install dlt with DuckDB **To install the DLT library with DuckDB dependencies, run:** -``` +```sh pip install dlt[duckdb] ``` ## Setup Guide **1. Initialize a project with a pipeline that loads to DuckDB by running:** -``` +```sh dlt init chess duckdb ``` **2. Install the necessary dependencies for DuckDB by running:** -``` +```sh pip install -r requirements.txt ``` **3. Run the pipeline:** -``` +```sh python3 chess_pipeline.py ``` @@ -47,7 +47,7 @@ naming="duck_case" ``` or via the env variable `SCHEMA__NAMING` or directly in the code: -```python +```py dlt.config["schema.naming"] = "duck_case" ``` :::caution @@ -73,7 +73,7 @@ You can configure the following file formats to load data to duckdb: By default, a DuckDB database will be created in the current working directory with a name `.duckdb` (`chess.duckdb` in the example above). After loading, it is available in `read/write` mode via `with pipeline.sql_client() as con:`, which is a wrapper over `DuckDBPyConnection`. See [duckdb docs](https://duckdb.org/docs/api/python/overview#persistent-storage) for details. The `duckdb` credentials do not require any secret values. You are free to pass the configuration explicitly via the `credentials` parameter to `dlt.pipeline` or `pipeline.run` methods. For example: -```python +```py # will load data to files/data.db database file p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='chess_data', full_refresh=False, credentials="files/data.db") @@ -82,7 +82,7 @@ p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='ches ``` The destination accepts a `duckdb` connection instance via `credentials`, so you can also open a database connection yourself and pass it to `dlt` to use. `:memory:` databases are supported. -```python +```py import duckdb db = duckdb.connect() p = dlt.pipeline(pipeline_name='chess', destination='duckdb', dataset_name='chess_data', full_refresh=False, credentials=db) @@ -92,7 +92,7 @@ This destination accepts database connection strings in the format used by [duck You can configure a DuckDB destination with [secret / config values](../../general-usage/credentials) (e.g., using a `secrets.toml` file) ```toml -destination.duckdb.credentials=duckdb:///_storage/test_quack.duckdb +destination.duckdb.credentials="duckdb:///_storage/test_quack.duckdb" ``` The **duckdb://** URL above creates a **relative** path to `_storage/test_quack.duckdb`. To define an **absolute** path, you need to specify four slashes, i.e., `duckdb:////_storage/test_quack.duckdb`. diff --git a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md index ba323b3d7f..dbd54253b3 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/filesystem.md +++ b/docs/website/docs/dlt-ecosystem/destinations/filesystem.md @@ -7,7 +7,7 @@ Its primary role is to be used as a staging for other destinations, but you can ## Install dlt with filesystem **To install the DLT library with filesystem dependencies:** -``` +```sh pip install dlt[filesystem] ``` @@ -29,7 +29,7 @@ so pip does not fail on backtracking. ### 1. Initialise the dlt project Let's start by initialising a new dlt project as follows: - ```bash + ```sh dlt init chess filesystem ``` > 💡 This command will initialise your pipeline with chess as the source and the AWS S3 filesystem as the destination. @@ -38,7 +38,7 @@ Let's start by initialising a new dlt project as follows: #### AWS S3 The command above creates sample `secrets.toml` and requirements file for AWS S3 bucket. You can install those dependencies by running: -``` +```sh pip install -r requirements.txt ``` @@ -71,7 +71,7 @@ You need to create a S3 bucket and a user who can access that bucket. `dlt` is n 1. You can create the S3 bucket in the AWS console by clicking on "Create Bucket" in S3 and assigning the appropriate name and permissions to the bucket. 2. Once the bucket is created, you'll have the bucket URL. For example, If the bucket name is `dlt-ci-test-bucket`, then the bucket URL will be: - ``` + ```text s3://dlt-ci-test-bucket ``` diff --git a/docs/website/docs/dlt-ecosystem/destinations/motherduck.md b/docs/website/docs/dlt-ecosystem/destinations/motherduck.md index 1288b9caac..de11ed5772 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/motherduck.md +++ b/docs/website/docs/dlt-ecosystem/destinations/motherduck.md @@ -9,7 +9,7 @@ keywords: [MotherDuck, duckdb, destination, data warehouse] ## Install dlt with MotherDuck **To install the DLT library with MotherDuck dependencies:** -``` +```sh pip install dlt[motherduck] ``` @@ -25,12 +25,12 @@ or export the **LOAD__WORKERS=3** env variable. See more in [performance](../../ ## Setup Guide **1. Initialize a project with a pipeline that loads to MotherDuck by running** -``` +```sh dlt init chess motherduck ``` **2. Install the necessary dependencies for MotherDuck by running** -``` +```sh pip install -r requirements.txt ``` @@ -51,7 +51,7 @@ motherduck.credentials="md:///dlt_data_3?token=" ``` **4. Run the pipeline** -``` +```sh python3 chess_pipeline.py ``` @@ -83,14 +83,14 @@ If your connection is of poor quality and you get a timeout when executing a DML ### I see some exception with home_dir missing when opening `md:` connection. Some internal component (HTTPS) requires the **HOME** env variable to be present. Export such a variable to the command line. Here is what we do in our tests: -```python +```py os.environ["HOME"] = "/tmp" ``` before opening the connection. ### I see some watchdog timeouts. We also see them. -``` +```text 'ATTACH_DATABASE': keepalive watchdog timeout ``` Our observation is that if you write a lot of data into the database, then close the connection and then open it again to write, there's a chance of such a timeout. A possible **WAL** file is being written to the remote duckdb database. diff --git a/docs/website/docs/dlt-ecosystem/destinations/mssql.md b/docs/website/docs/dlt-ecosystem/destinations/mssql.md index 5ed4b69707..fc3eede075 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/mssql.md +++ b/docs/website/docs/dlt-ecosystem/destinations/mssql.md @@ -8,7 +8,7 @@ keywords: [mssql, sqlserver, destination, data warehouse] ## Install dlt with MS SQL **To install the DLT library with MS SQL dependencies, use:** -``` +```sh pip install dlt[mssql] ``` @@ -28,16 +28,16 @@ You can also [configure the driver name](#additional-destination-options) explic ### Create a pipeline **1. Initialize a project with a pipeline that loads to MS SQL by running:** -``` +```sh dlt init chess mssql ``` **2. Install the necessary dependencies for MS SQL by running:** -``` +```sh pip install -r requirements.txt ``` or run: -``` +```sh pip install dlt[mssql] ``` This will install `dlt` with the `mssql` extra, which contains all the dependencies required by the SQL server client. @@ -62,7 +62,7 @@ destination.mssql.credentials="mssql://loader:@loader.database.windows ``` To pass credentials directly, you can use the `credentials` argument passed to `dlt.pipeline` or `pipeline.run` methods. -```python +```py pipeline = dlt.pipeline(pipeline_name='chess', destination='postgres', dataset_name='chess_data', credentials="mssql://loader:@loader.database.windows.net/dlt_data?connect_timeout=15") ``` diff --git a/docs/website/docs/dlt-ecosystem/destinations/postgres.md b/docs/website/docs/dlt-ecosystem/destinations/postgres.md index 10b935c083..ddf4aae9f8 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/postgres.md +++ b/docs/website/docs/dlt-ecosystem/destinations/postgres.md @@ -8,39 +8,39 @@ keywords: [postgres, destination, data warehouse] ## Install dlt with PostgreSQL **To install the DLT library with PostgreSQL dependencies, run:** -``` +```sh pip install dlt[postgres] ``` ## Setup Guide **1. Initialize a project with a pipeline that loads to Postgres by running:** -``` +```sh dlt init chess postgres ``` **2. Install the necessary dependencies for Postgres by running:** -``` +```sh pip install -r requirements.txt ``` This will install dlt with the `postgres` extra, which contains the `psycopg2` client. **3. After setting up a Postgres instance and `psql` / query editor, create a new database by running:** -``` +```sql CREATE DATABASE dlt_data; ``` Add the `dlt_data` database to `.dlt/secrets.toml`. **4. Create a new user by running:** -``` +```sql CREATE USER loader WITH PASSWORD ''; ``` Add the `loader` user and `` password to `.dlt/secrets.toml`. **5. Give the `loader` user owner permissions by running:** -``` +```sql ALTER DATABASE dlt_data OWNER TO loader; ``` @@ -66,7 +66,7 @@ destination.postgres.credentials="postgresql://loader:@localhost/dlt_d ``` To pass credentials directly, you can use the `credentials` argument passed to the `dlt.pipeline` or `pipeline.run` methods. -```python +```py pipeline = dlt.pipeline(pipeline_name='chess', destination='postgres', dataset_name='chess_data', credentials="postgresql://loader:@localhost/dlt_data") ``` diff --git a/docs/website/docs/dlt-ecosystem/destinations/qdrant.md b/docs/website/docs/dlt-ecosystem/destinations/qdrant.md index ff37252852..40d85a43a5 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/qdrant.md +++ b/docs/website/docs/dlt-ecosystem/destinations/qdrant.md @@ -13,7 +13,7 @@ This destination helps you load data into Qdrant from [dlt resources](../../gene 1. To use Qdrant as a destination, make sure `dlt` is installed with the `qdrant` extra: -```bash +```sh pip install dlt[qdrant] ``` @@ -31,7 +31,7 @@ If no configuration options are provided, the default fallback will be `http://l 3. Define the source of the data. For starters, let's load some data from a simple data structure: -```python +```py import dlt from dlt.destinations.adapters import qdrant_adapter @@ -53,7 +53,7 @@ movies = [ 4. Define the pipeline: -```python +```py pipeline = dlt.pipeline( pipeline_name="movies", destination="qdrant", @@ -63,7 +63,7 @@ pipeline = dlt.pipeline( 5. Run the pipeline: -```python +```py info = pipeline.run( qdrant_adapter( movies, @@ -74,7 +74,7 @@ info = pipeline.run( 6. Check the results: -```python +```py print(info) ``` @@ -86,7 +86,7 @@ To use vector search after the data has been loaded, you must specify which fiel The `qdrant_adapter` is a helper function that configures the resource for the Qdrant destination: -```python +```py qdrant_adapter(data, embed) ``` @@ -99,7 +99,7 @@ Returns: [DLT resource](../../general-usage/resource.md) object that you can pas Example: -```python +```py qdrant_adapter( resource, embed=["title", "description"], @@ -122,7 +122,7 @@ The [replace](../../general-usage/full-loading.md) disposition replaces the data In the movie example from the [setup guide](#setup-guide), we can use the `replace` disposition to reload the data every time we run the pipeline: -```python +```py info = pipeline.run( qdrant_adapter( movies, @@ -137,7 +137,7 @@ info = pipeline.run( The [merge](../../general-usage/incremental-loading.md) write disposition merges the data from the resource with the data at the destination. For the `merge` disposition, you need to specify a `primary_key` for the resource: -```python +```py info = pipeline.run( qdrant_adapter( movies, @@ -170,7 +170,7 @@ However, if you prefer to have class names without the dataset prefix, skip the For example: -```python +```py pipeline = dlt.pipeline( pipeline_name="movies", destination="qdrant", diff --git a/docs/website/docs/dlt-ecosystem/destinations/redshift.md b/docs/website/docs/dlt-ecosystem/destinations/redshift.md index bc03dbbbeb..7b56377f3b 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/redshift.md +++ b/docs/website/docs/dlt-ecosystem/destinations/redshift.md @@ -8,7 +8,7 @@ keywords: [redshift, destination, data warehouse] ## Install dlt with Redshift **To install the DLT library with Redshift dependencies:** -``` +```sh pip install dlt[redshift] ``` @@ -17,13 +17,13 @@ pip install dlt[redshift] Let's start by initializing a new dlt project as follows: -```bash +```sh dlt init chess redshift ``` > 💡 This command will initialize your pipeline with chess as the source and Redshift as the destination. The above command generates several files and directories, including `.dlt/secrets.toml` and a requirements file for Redshift. You can install the necessary dependencies specified in the requirements file by executing it as follows: -```bash +```sh pip install -r requirements.txt ``` or with `pip install dlt[redshift]`, which installs the `dlt` library and the necessary dependencies for working with Amazon Redshift as a destination. @@ -52,7 +52,7 @@ To load data into Redshift, you need to create a Redshift cluster and enable acc 2. The "host" is derived from the cluster endpoint specified in the “General Configuration.” For example: - ```bash + ```sh # If the endpoint is: redshift-cluster-1.cv3cmsy7t4il.us-east-1.redshift.amazonaws.com:5439/your_database_name # Then the host is: @@ -108,7 +108,7 @@ staging_iam_role="arn:aws:iam::..." ### Redshift/S3 staging example code -```python +```py # Create a dlt pipeline that will load # chess player data to the redshift destination # via staging on s3 diff --git a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md index a6058a255e..a65eaec267 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/snowflake.md +++ b/docs/website/docs/dlt-ecosystem/destinations/snowflake.md @@ -8,19 +8,19 @@ keywords: [Snowflake, destination, data warehouse] ## Install dlt with Snowflake **To install the DLT library with Snowflake dependencies, run:** -``` +```sh pip install dlt[snowflake] ``` ## Setup Guide **1. Initialize a project with a pipeline that loads to Snowflake by running:** -``` +```sh dlt init chess snowflake ``` **2. Install the necessary dependencies for Snowflake by running:** -``` +```sh pip install -r requirements.txt ``` This will install `dlt` with the `snowflake` extra, which contains the Snowflake Python dbapi client. @@ -162,12 +162,12 @@ To prevent dlt from forwarding the S3 bucket credentials on every command, and s ```toml [destination] -stage_name=PUBLIC.my_s3_stage +stage_name="PUBLIC.my_s3_stage" ``` To run Snowflake with S3 as the staging destination: -```python +```py # Create a dlt pipeline that will load # chess player data to the Snowflake destination # via staging on S3 @@ -191,12 +191,12 @@ Please refer to the [Google Storage filesystem documentation](./filesystem.md#go ```toml [destination] -stage_name=PUBLIC.my_gcs_stage +stage_name="PUBLIC.my_gcs_stage" ``` To run Snowflake with GCS as the staging destination: -```python +```py # Create a dlt pipeline that will load # chess player data to the Snowflake destination # via staging on GCS @@ -222,12 +222,12 @@ Please consult the Snowflake Documentation on [how to create a stage for your Az ```toml [destination] -stage_name=PUBLIC.my_azure_stage +stage_name="PUBLIC.my_azure_stage" ``` To run Snowflake with Azure as the staging destination: -```python +```py # Create a dlt pipeline that will load # chess player data to the Snowflake destination # via staging on Azure diff --git a/docs/website/docs/dlt-ecosystem/destinations/synapse.md b/docs/website/docs/dlt-ecosystem/destinations/synapse.md index bac184fd41..d803b88a2c 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/synapse.md +++ b/docs/website/docs/dlt-ecosystem/destinations/synapse.md @@ -8,7 +8,7 @@ keywords: [synapse, destination, data warehouse] ## Install dlt with Synapse **To install the DLT library with Synapse dependencies:** -``` +```sh pip install dlt[synapse] ``` @@ -32,12 +32,12 @@ pip install dlt[synapse] ### Steps **1. Initialize a project with a pipeline that loads to Synapse by running** -``` +```sh dlt init chess synapse ``` **2. Install the necessary dependencies for Synapse by running** -``` +```sh pip install -r requirements.txt ``` This will install `dlt` with the **synapse** extra that contains all dependencies required for the Synapse destination. @@ -86,7 +86,7 @@ destination.synapse.credentials = "synapse://loader:your_loader_password@your_sy ``` To pass credentials directly you can use the `credentials` argument of `dlt.destinations.synapse(...)`: -```python +```py pipeline = dlt.pipeline( pipeline_name='chess', destination=dlt.destinations.synapse( @@ -117,7 +117,7 @@ Data is loaded via `INSERT` statements by default. ## Table index type The [table index type](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index) of the created tables can be configured at the resource level with the `synapse_adapter`: -```python +```py info = pipeline.run( synapse_adapter( data=your_resource, @@ -156,7 +156,7 @@ Please refer to the [Azure Blob Storage filesystem documentation](./filesystem.m To run Synapse with staging on Azure Blob Storage: -```python +```py # Create a dlt pipeline that will load # chess player data to the snowflake destination # via staging on Azure Blob Storage diff --git a/docs/website/docs/dlt-ecosystem/destinations/weaviate.md b/docs/website/docs/dlt-ecosystem/destinations/weaviate.md index 6bd52acd35..fb87ccfa6f 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/weaviate.md +++ b/docs/website/docs/dlt-ecosystem/destinations/weaviate.md @@ -13,7 +13,7 @@ This destination helps you load data into Weaviate from [dlt resources](../../ge 1. To use Weaviate as a destination, make sure dlt is installed with the 'weaviate' extra: -```bash +```sh pip install dlt[weaviate] ``` @@ -41,7 +41,7 @@ The `url` will default to **http://localhost:8080** and `api_key` is not defined 3. Define the source of the data. For starters, let's load some data from a simple data structure: -```python +```py import dlt from dlt.destinations.adapters import weaviate_adapter @@ -63,7 +63,7 @@ movies = [ 4. Define the pipeline: -```python +```py pipeline = dlt.pipeline( pipeline_name="movies", destination="weaviate", @@ -73,7 +73,7 @@ pipeline = dlt.pipeline( 5. Run the pipeline: -```python +```py info = pipeline.run( weaviate_adapter( movies, @@ -84,7 +84,7 @@ info = pipeline.run( 6. Check the results: -```python +```py print(info) ``` @@ -96,7 +96,7 @@ Weaviate destination is different from other [dlt destinations](../destinations/ The `weaviate_adapter` is a helper function that configures the resource for the Weaviate destination: -```python +```py weaviate_adapter(data, vectorize, tokenization) ``` @@ -109,7 +109,7 @@ Returns: a [dlt resource](../../general-usage/resource.md) object that you can p Example: -```python +```py weaviate_adapter( resource, vectorize=["title", "description"], @@ -133,7 +133,7 @@ The [replace](../../general-usage/full-loading.md) disposition replaces the data In the movie example from the [setup guide](#setup-guide), we can use the `replace` disposition to reload the data every time we run the pipeline: -```python +```py info = pipeline.run( weaviate_adapter( movies, @@ -148,7 +148,7 @@ info = pipeline.run( The [merge](../../general-usage/incremental-loading.md) write disposition merges the data from the resource with the data in the destination. For the `merge` disposition, you would need to specify a `primary_key` for the resource: -```python +```py info = pipeline.run( weaviate_adapter( movies, @@ -203,7 +203,7 @@ However, if you prefer to have class names without the dataset prefix, skip the For example: -```python +```py pipeline = dlt.pipeline( pipeline_name="movies", destination="weaviate", @@ -246,7 +246,7 @@ You can configure an alternative naming convention which will lowercase all prop {"camelCase": 1, "CamelCase": 2} ``` it will be normalized to: -``` +```json {"camelcase": 2} ``` so your best course of action is to clean up the data yourself before loading and use the default naming convention. Nevertheless, you can configure the alternative in `config.toml`: diff --git a/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md b/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md index ff73e3741e..641be9a106 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/insert-format.md @@ -25,6 +25,6 @@ It is also supported by: **filesystem**. By setting the `loader_file_format` argument to `insert_values` in the run command, the pipeline will store your data in the INSERT format at the destination: -```python +```py info = pipeline.run(some_source(), loader_file_format="insert_values") ``` diff --git a/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md b/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md index 130464578e..7467c6f639 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md @@ -27,6 +27,6 @@ This format is used by default by: **BigQuery**, **Snowflake**, **filesystem**. By setting the `loader_file_format` argument to `jsonl` in the run command, the pipeline will store your data in the jsonl format at the destination: -```python +```py info = pipeline.run(some_source(), loader_file_format="jsonl") ``` diff --git a/docs/website/docs/dlt-ecosystem/file-formats/parquet.md b/docs/website/docs/dlt-ecosystem/file-formats/parquet.md index cc2fcfb200..94aaaf4884 100644 --- a/docs/website/docs/dlt-ecosystem/file-formats/parquet.md +++ b/docs/website/docs/dlt-ecosystem/file-formats/parquet.md @@ -20,7 +20,7 @@ Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **filesystem**, **Athena* By setting the `loader_file_format` argument to `parquet` in the run command, the pipeline will store your data in the parquet format at the destination: -```python +```py info = pipeline.run(some_source(), loader_file_format="parquet") ``` @@ -53,7 +53,7 @@ timestamp_timezone="Europe/Berlin" Or using environment variables: -``` +```sh NORMALIZE__DATA_WRITER__FLAVOR NORMALIZE__DATA_WRITER__VERSION NORMALIZE__DATA_WRITER__DATA_PAGE_SIZE diff --git a/docs/website/docs/dlt-ecosystem/staging.md b/docs/website/docs/dlt-ecosystem/staging.md index d2ed03a2a2..e3a60dfa51 100644 --- a/docs/website/docs/dlt-ecosystem/staging.md +++ b/docs/website/docs/dlt-ecosystem/staging.md @@ -48,7 +48,7 @@ In essence, you need to set up two destinations and then pass them to `dlt.pipel 4. **Chain staging to destination and request `parquet` file format.** Pass the `staging` argument to `dlt.pipeline`. It works like the destination `argument`: - ```python + ```py # Create a dlt pipeline that will load # chess player data to the redshift destination # via staging on s3 @@ -60,7 +60,7 @@ In essence, you need to set up two destinations and then pass them to `dlt.pipel ) ``` `dlt` will automatically select an appropriate loader file format for the staging files. Below we explicitly specify `parquet` file format (just to demonstrate how to do it): - ```python + ```py info = pipeline.run(chess(), loader_file_format="parquet") ``` diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md index 1cf7a91bfb..42f31d4875 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md +++ b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt.md @@ -33,7 +33,7 @@ Included below is another example where we run a `dlt` pipeline and then a dbt p > 💡 Docstrings are available to read in your IDE. -```python +```py # load all pipedrive endpoints to pipedrive_raw dataset pipeline = dlt.pipeline( pipeline_name='pipedrive', diff --git a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md index 43321aab97..d15c4eb84c 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md +++ b/docs/website/docs/dlt-ecosystem/transformations/dbt/dbt_cloud.md @@ -11,7 +11,7 @@ keywords: [transform, sql] The DBT Cloud Client is a Python class designed to interact with the dbt Cloud API (version 2). It provides methods to perform various operations on dbt Cloud, such as triggering job runs and retrieving job run statuses. -```python +```py from dlt.helpers.dbt_cloud import DBTCloudClientV2 # Initialize the client @@ -36,7 +36,7 @@ They simplify the process of triggering and monitoring job runs in dbt Cloud. This function triggers a job run in dbt Cloud using the specified configuration. It supports various customization options and allows for monitoring the job's status. -```python +```py from dlt.helpers.dbt_cloud import run_dbt_cloud_job # Trigger a job run with default configuration @@ -58,7 +58,7 @@ If you have already started a job run and have a run ID, then you can use the `g This function retrieves the full information about a specific dbt Cloud job run. It also supports options for waiting until the run is complete. -```python +```py from dlt.helpers.dbt_cloud import get_dbt_cloud_run_status # Retrieve status for a specific run @@ -96,7 +96,7 @@ For environment variables, all names are capitalized and sections are separated For example, for the above secrets, we would need to put into the environment: -``` +```sh DBT_CLOUD__API_TOKEN DBT_CLOUD__ACCOUNT_ID DBT_CLOUD__JOB_ID diff --git a/docs/website/docs/dlt-ecosystem/transformations/pandas.md b/docs/website/docs/dlt-ecosystem/transformations/pandas.md index dc2fc6d40a..5a82d8be66 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/pandas.md +++ b/docs/website/docs/dlt-ecosystem/transformations/pandas.md @@ -11,7 +11,7 @@ natively (i.e., BigQuery and DuckDB), `dlt` uses the native method. Thanks to th dataframes can be really fast! The example below reads GitHub reactions data from the `issues` table and counts the reaction types. -```python +```py pipeline = dlt.pipeline( pipeline_name="github_pipeline", destination="duckdb", diff --git a/docs/website/docs/dlt-ecosystem/transformations/sql.md b/docs/website/docs/dlt-ecosystem/transformations/sql.md index 6131cac85a..ad37c61bd8 100644 --- a/docs/website/docs/dlt-ecosystem/transformations/sql.md +++ b/docs/website/docs/dlt-ecosystem/transformations/sql.md @@ -12,22 +12,24 @@ including statements that change the database schema or data in the tables. In t insert a row into the `customers` table. Note that the syntax is the same as for any standard `dbapi` connection. -```python +```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") try: with pipeline.sql_client() as client: client.sql_client.execute_sql( - f"INSERT INTO customers VALUES (%s, %s, %s)", + "INSERT INTO customers VALUES (%s, %s, %s)", 10, "Fred", "fred@fred.com" ) +except Exception: + ... ``` In the case of SELECT queries, the data is returned as a list of rows, with the elements of a row corresponding to selected columns. -```python +```py try: with pipeline.sql_client() as client: res = client.execute_sql( @@ -36,6 +38,8 @@ try: ) # prints column values of the first row print(res[0]) +except Exception: + ... ``` ## Other transforming tools diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/airtable.md b/docs/website/docs/dlt-ecosystem/verified-sources/airtable.md index 0baf1917d1..a920b21a03 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/airtable.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/airtable.md @@ -45,7 +45,7 @@ Sources and resources that can be loaded using this verified source are: Upon logging into Airtable and accessing your base or table, you'll notice a URL in your browser's address bar resembling: -```bash +```sh https://airtable.com/appve10kl227BIT4GV/tblOUnZVLFWbemTP1/viw3qtF76bRQC3wKx/rec9khXgeTotgCQ62?blocks=hide ``` @@ -67,7 +67,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init airtable duckdb ``` @@ -116,20 +116,20 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python airtable_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -147,13 +147,14 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function retrieves tables from given Airtable base. -```python +```py @dlt.source def airtable_source( base_id: str = dlt.config.value, table_names: Optional[List[str]] = None, access_token: str = dlt.secrets.value, ) -> Iterable[DltResource]: + ... ``` `base_id`: The base's unique identifier. @@ -167,12 +168,13 @@ tables in the schema are loaded. This function retrieves data from a single Airtable table. -```python +```py def airtable_resource( api: pyairtable.Api, base_id: str, table: Dict[str, Any], ) -> DltResource: + ... ``` `table`: Airtable metadata, excluding actual records. @@ -186,7 +188,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="airtable", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -196,16 +198,16 @@ verified source. 1. To load the entire base: - ```python + ```py base_id = "Please set me up!" # The id of the base. - airtables = airtable_source(base_id=base_id)) + airtables = airtable_source(base_id=base_id) load_info = pipeline.run(load_data, write_disposition="replace") ``` 1. To load selected tables from a base table: - ```python + ```py base_id = "Please set me up!" # The id of the base. table_names = ["Table1","Table2"] # A list of table IDs or table names to load. @@ -221,7 +223,7 @@ verified source. 1. To load data and apply hints to a specific column: - ```python + ```py base_id = "Please set me up!" # The id of the base. table_names = ["Table1","Table2"] # A list of table IDs or table names to load. resource_name = "Please set me up!" # The table name we want to apply hints. diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md b/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md index 4118902a6c..2894c15b5e 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/amazon_kinesis.md @@ -57,7 +57,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init kinesis duckdb ``` @@ -110,16 +110,16 @@ For more information, read [Credentials](../../general-usage/credentials). 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python kinesis_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `kinesis_pipeline`. You may @@ -138,7 +138,7 @@ This resource reads a Kinesis stream and yields messages. It supports [incremental loading](../../general-usage/incremental-loading) and parses messages as json by default. -```python +```py @dlt.resource( name=lambda args: args["stream_name"], primary_key="_kinesis_msg_id", @@ -156,6 +156,7 @@ def kinesis_stream( parse_json: bool = True, chunk_size: int = 1000, ) -> Iterable[TDataItem]: + ... ``` `stream_name`: Name of the Kinesis stream. Defaults to config/secrets if unspecified. @@ -212,7 +213,7 @@ verified source. 1. Configure the [pipeline](../../general-usage/pipeline) by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="kinesis_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -222,7 +223,7 @@ verified source. 1. To load messages from a stream from the last one hour: - ```python + ```py # the resource below will take its name from the stream name, # it can be used multiple times by default it assumes that Data is json and parses it, # here we disable that to just get bytes in data elements of the message @@ -237,7 +238,7 @@ verified source. 1. For incremental Kinesis streams, to fetch only new messages: - ```python + ```py #running pipeline will get only new messages info = pipeline.run(kinesis_stream_data) message_counts = pipeline.last_trace.last_normalize_info.row_counts @@ -249,7 +250,7 @@ verified source. 1. To parse json with a simple decoder: - ```python + ```py def _maybe_parse_json(item: TDataItem) -> TDataItem: try: item.update(json.loadb(item["data"])) @@ -263,7 +264,7 @@ verified source. 1. To read Kinesis messages and send them somewhere without using a pipeline: - ```python + ```py from dlt.common.configuration.container import Container from dlt.common.pipeline import StateInjectableContext diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md b/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md index df968422d7..915a9d297a 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md @@ -25,7 +25,7 @@ To write an Arrow source, pass any `pyarrow.Table`, `pyarrow.RecordBatch` or `pa This example loads a Pandas dataframe to a Snowflake table: -```python +```py import dlt from dlt.common import pendulum import pandas as pd @@ -45,7 +45,7 @@ pipeline.run(df, table_name="orders") A `pyarrow` table can be loaded in the same way: -```python +```py import pyarrow as pa # Create dataframe and pipeline same as above @@ -96,7 +96,7 @@ Usage is the same as without other dlt resources. Refer to the [incremental load Example: -```python +```py import dlt from dlt.common import pendulum import pandas as pd @@ -144,7 +144,7 @@ All struct types are represented as `complex` and will be loaded as JSON (if des even if they are present in the destination. If you want to represent nested data as separated tables, you must yield panda frames and arrow tables as records. In the examples above: -```python +```py # yield panda frame as records pipeline.run(df.to_dict(orient='records'), table_name="orders") diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/asana.md b/docs/website/docs/dlt-ecosystem/verified-sources/asana.md index 8554cdd376..9e3ee9c8fe 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/asana.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/asana.md @@ -56,7 +56,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init asana_dlt duckdb ``` @@ -94,16 +94,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python asana_dlt_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `asana`, you may also use any @@ -127,7 +127,7 @@ it is important to note the complete list of the default endpoints given in This is a `dlt.source` function, which returns a list of DltResource objects: "workspaces", "projects", "sections","tags","tasks","stories", "teams", and "users". -```python +```py @dlt.source def asana_source(access_token: str = dlt.secrets.value) -> Any: return [ @@ -142,7 +142,7 @@ def asana_source(access_token: str = dlt.secrets.value) -> Any: This is a `dlt.resource` function, which returns collections of tasks and related information. -```python +```py @dlt.resource(write_disposition="replace") def workspaces( access_token: str = dlt.secrets.value, @@ -171,7 +171,7 @@ transformer functions transform or process data from one or more resources. The transformer function `projects` process data from the `workspaces` resource. It fetches and returns a list of projects for a given workspace from Asana. -```python +```py @dlt.transformer( data_from=workspaces, write_disposition="replace", @@ -200,7 +200,7 @@ It uses `@dlt.defer` decorator to enable parallel run in thread pool. This [incremental](../../general-usage/incremental-loading.md) resource-transformer fetches all tasks for a given project from Asana. -```python +```py @dlt.transformer(data_from=projects, write_disposition="merge", primary_key="gid") def tasks( project_array: t.List[TDataItem], @@ -235,7 +235,7 @@ these steps: 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="asana_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -248,13 +248,13 @@ these steps: 1. To load the data from all the fields, you can utilise the `asana_source` method as follows: - ```python + ```py load_data = asana_source() ``` 1. Use the method `pipeline.run()` to execute the pipeline. - ```python + ```py load_info = pipeline.run(load_data) # print the information on data that was loaded print(load_info) @@ -263,7 +263,7 @@ these steps: 1. To use the method `pipeline.run()` to load custom endpoints “workspaces” and “projects”, the above script may be modified as: - ```python + ```py load_info = pipeline.run(load_data.with_resources("workspaces", "projects")) # print the information on data that was loaded print(load_info) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/chess.md b/docs/website/docs/dlt-ecosystem/verified-sources/chess.md index 7f01b83f08..2341680d97 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/chess.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/chess.md @@ -36,7 +36,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init chess duckdb ``` @@ -66,20 +66,20 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python chess_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -98,7 +98,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This is a `dlt.source` function for the Chess.com API named "chess", which returns a sequence of DltResource objects. That we'll discuss in subsequent sections as resources. -```python +```py dlt.source(name="chess") def source( players: List[str], start_month: str = None, end_month: str = None @@ -120,7 +120,7 @@ to fetch game data (in "YYYY/MM" format). This is a `dlt.resource` function, which returns player profiles for a list of player usernames. -```python +```py @dlt.resource(write_disposition="replace") def players_profiles(players: List[str]) -> Iterator[TDataItem]: @@ -138,7 +138,7 @@ It uses `@dlt.defer` decorator to enable parallel run in thread pool. This is a `dlt.resource` function, which returns url to game archives for specified players. -```python +```py @dlt.resource(write_disposition="replace", selected=False) def players_archives(players: List[str]) -> Iterator[List[TDataItem]]: ... @@ -154,7 +154,7 @@ runs. This incremental resource takes data from players and returns games for the last month if not specified otherwise. -```python +```py @dlt.resource(write_disposition="append") def players_games( players: List[str], start_month: str = None, end_month: str = None @@ -186,7 +186,7 @@ To create your data loading pipeline for players and load data, follow these ste 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="chess_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -199,7 +199,7 @@ To create your data loading pipeline for players and load data, follow these ste 1. To load the data from all the resources for specific players (e.g. for November), you can utilise the `source` method as follows: - ```python + ```py # Loads games for Nov 2022 data = source( ["magnuscarlsen", "vincentkeymer", "dommarajugukesh", "rpragchess"], @@ -210,7 +210,7 @@ To create your data loading pipeline for players and load data, follow these ste 1. Use the method `pipeline.run()` to execute the pipeline. - ```python + ```py info = pipeline.run(data) # print the information on data that was loaded print(info) @@ -219,7 +219,7 @@ To create your data loading pipeline for players and load data, follow these ste 1. To load data from specific resources like "players_games" and "player_profiles", modify the above code as: - ```python + ```py info = pipeline.run(data.with_resources("players_games", "players_profiles")) # print the information on data that was loaded print(info) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/facebook_ads.md b/docs/website/docs/dlt-ecosystem/verified-sources/facebook_ads.md index dea97921b4..0a0c64fb30 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/facebook_ads.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/facebook_ads.md @@ -66,9 +66,9 @@ By default, Facebook access tokens have a short lifespan of one hour. To exchang Facebook access token for a long-lived token, update the `.dlt/secrets.toml` with client_id, and client_secret and execute the provided Python code. -```python +```py from facebook_ads import get_long_lived_token -print(get_long_lived_token("your short-lived token") +print(get_long_lived_token("your short-lived token")) ``` Replace the `access_token` in the `.dlt/secrets.toml` file with the long-lived token obtained from @@ -77,7 +77,7 @@ the above code snippet. To retrieve the expiry date and the associated scopes of the token, you can use the following command: -```python +```py from facebook_ads import debug_access_token debug_access_token() ``` @@ -88,7 +88,7 @@ level. In `config.toml` / `secrets.toml`: ```toml [sources.facebook_ads] -access_token_expires_at=1688821881... +access_token_expires_at=1688821881 ``` > Note: The Facebook UI, which is described here, might change. @@ -101,7 +101,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init facebook_ads duckdb ``` @@ -158,16 +158,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python facebook_ads_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `facebook_ads`, you may also @@ -191,7 +191,7 @@ it is important to note the complete list of the default endpoints given in This function returns a list of resources to load campaigns, ad sets, ads, creatives, and ad leads data from Facebook Marketing API. -```python +```py @dlt.source(name="facebook_ads") def facebook_ads_source( account_id: str = dlt.config.value, @@ -200,6 +200,7 @@ def facebook_ads_source( request_timeout: float = 300.0, app_api_version: str = None, ) -> Sequence[DltResource]: + ... ``` `account_id`: Account id associated with add manager, configured in "config.toml". @@ -220,7 +221,7 @@ were issued i.e. 'v17.0'. Defaults to the _facebook_business_ library default ve The ads function fetches ad data. It retrieves ads from a specified account with specific fields and states. -```python +```py @dlt.resource(primary_key="id", write_disposition="replace") def ads( fields: Sequence[str] = DEFAULT_AD_FIELDS, @@ -254,7 +255,7 @@ The default fields are defined in This function returns a list of resources to load facebook_insights. -```python +```py @dlt.source(name="facebook_ads") def facebook_insights_source( account_id: str = dlt.config.value, @@ -271,6 +272,7 @@ def facebook_insights_source( request_timeout: int = 300, app_api_version: str = None, ) -> DltResource: + ... ``` `account_id`: Account id associated with ads manager, configured in _config.toml_. @@ -315,13 +317,14 @@ were issued i.e. 'v17.0'. Defaults to the facebook_business library default vers This function fetches Facebook insights data incrementally from a specified start date until the current date, in day steps. -```python +```py @dlt.resource(primary_key=INSIGHTS_PRIMARY_KEY, write_disposition="merge") def facebook_insights( date_start: dlt.sources.incremental[str] = dlt.sources.incremental( "date_start", initial_value=initial_load_start_date_str ) ) -> Iterator[TDataItems]: + ... ``` `date_start`: Parameter sets the initial value for the "date_start" parameter in @@ -337,7 +340,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="facebook_ads", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -350,7 +353,7 @@ verified source. 1. To load all the data from, campaigns, ad sets, ads, ad creatives and leads. - ```python + ```py load_data = facebook_ads_source() load_info = pipeline.run(load_data) print(load_info) @@ -359,7 +362,7 @@ verified source. 1. To merge the Facebook Ads with the state “DISAPPROVED” and with ads state “PAUSED” you can do the following: - ```python + ```py load_data = facebook_ads_source() # It is recommended to enable root key propagation on a source that is not a merge one by default. this is not required if you always use merge but below we start with replace load_data.root_key = True @@ -382,7 +385,7 @@ verified source. 1. To load data with a custom field, for example, to load only “id” from Facebook ads, you can do the following: - ```python + ```py load_data = facebook_ads_source() # Only loads add ids, works the same for campaigns, leads etc. load_data.ads.bind(fields=("id",)) @@ -395,7 +398,7 @@ verified source. demonstrates how to enrich objects by adding an enrichment transformation that includes additional fields. - ```python + ```py # You can reduce the chunk size for smaller requests load_data = facebook_ads_source(chunk_size=2) @@ -429,7 +432,7 @@ verified source. breakdowns, etc. As defined in the `facebook_insights_source`. This function generates daily reports for a specified number of past days. - ```python + ```py load_data = facebook_insights_source( initial_load_past_days=30, attribution_window_days_lag= 7, diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem.md b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem.md index aed19838ef..bf3d23d0a3 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem.md @@ -81,7 +81,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init filesystem duckdb ``` @@ -150,32 +150,32 @@ For more information, read the 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. Install optional modules: - For AWS S3: - ```bash + ```sh pip install s3fs ``` - For Azure blob: - ```bash + ```sh pip install adlfs>=2023.9.0 ``` - GCS storage: No separate module needed. 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python filesystem_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -197,13 +197,14 @@ This source offers chunked file readers as resources, which can be optionally cu - `read_jsonl()` - `read_parquet()` -```python +```py @dlt.source(_impl_cls=ReadersSource, spec=FilesystemConfigurationResource) def readers( bucket_url: str = dlt.secrets.value, credentials: Union[FileSystemCredentials, AbstractFileSystem] = dlt.secrets.value, file_glob: Optional[str] = "*", ) -> Tuple[DltResource, ...]: + ... ``` - `bucket_url`: The url to the bucket. @@ -225,7 +226,7 @@ This resource lists files in `bucket_url` based on the `file_glob` pattern, retu [FileItem](https://github.com/dlt-hub/dlt/blob/devel/dlt/common/storages/fsspec_filesystem.py#L22) with data access methods. These can be paired with transformers for enhanced processing. -```python +```py @dlt.resource( primary_key="file_url", spec=FilesystemConfigurationResource, standalone=True ) @@ -236,6 +237,7 @@ def filesystem( files_per_page: int = DEFAULT_CHUNK_SIZE, extract_content: bool = False, ) -> Iterator[List[FileItem]]: + ... ``` - `bucket_url`: URL of the bucket. @@ -256,9 +258,9 @@ in bucket URL. To load data into a specific table (instead of the default filesystem table), see the snippet below: -```python +```py @dlt.transformer(standalone=True) -def read_csv(items, chunksize: int = 15) ->: +def read_csv(items, chunksize: int = 15): """Reads csv file with Pandas chunk by chunk.""" ... @@ -275,7 +277,7 @@ Use the [standalone filesystem](../../general-usage/resource#declare-a-standalone-resource) resource to list files in s3, GCS, and Azure buckets. This allows you to customize file readers or manage files using [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html). -```python +```py files = filesystem(bucket_url="s3://my_bucket/data", file_glob="csv_folder/*.csv") pipeline.run(files) ``` @@ -327,7 +329,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="standard_filesystem", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -337,17 +339,17 @@ verified source. 1. To read and load CSV files: - ```python + ```py BUCKET_URL = "YOUR_BUCKET_PATH_HERE" # path of the bucket url or local destination met_files = readers( bucket_url=BUCKET_URL, file_glob="directory/*.csv" - ).read_csv() - # tell dlt to merge on date - met_files.apply_hints(write_disposition="merge", merge_key="date") - # We load the data into the met_csv table - load_info = pipeline.run(met_files.with_name("table_name")) - print(load_info) - print(pipeline.last_trace.last_normalize_info) + ).read_csv() + # tell dlt to merge on date + met_files.apply_hints(write_disposition="merge", merge_key="date") + # We load the data into the met_csv table + load_info = pipeline.run(met_files.with_name("table_name")) + print(load_info) + print(pipeline.last_trace.last_normalize_info) ``` - The `file_glob` parameter targets all CSVs in the "met_csv/A801" directory. @@ -358,7 +360,7 @@ verified source. ::: 1. To load only new CSV files with [incremental loading](../../general-usage/incremental-loading): - ```python + ```py # This configuration will only consider new csv files new_files = filesystem(bucket_url=BUCKET_URL, file_glob="directory/*.csv") # add incremental on modification time @@ -369,7 +371,7 @@ verified source. ``` 1. To read and load Parquet and JSONL from a bucket: - ```python + ```py jsonl_reader = readers(BUCKET_URL, file_glob="**/*.jsonl").read_jsonl( chunksize=10000 ) @@ -391,7 +393,7 @@ verified source. 1. To set up a pipeline that reads from an Excel file using a standalone transformer: - ```python + ```py # Define a standalone transformer to read data from an Excel file. @dlt.transformer(standalone=True) def read_excel( @@ -427,7 +429,7 @@ verified source. 1. To copy files locally, add a step in the filesystem resource and then load the listing to the database: - ```python + ```py def _copy(item: FileItemDict) -> FileItemDict: # instantiate fsspec and copy file dest_file = os.path.join(local_folder, item["file_name"]) @@ -459,7 +461,7 @@ verified source. You can get a fsspec client from filesystem resource after it was extracted i.e. in order to delete processed files etc. The filesystem module contains a convenient method `fsspec_from_resource` that can be used as follows: - ```python + ```py from filesystem import filesystem, fsspec_from_resource # get filesystem source gs_resource = filesystem("gs://ci-test-bucket/") diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/github.md b/docs/website/docs/dlt-ecosystem/verified-sources/github.md index 2fd0277500..4c9a322760 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/github.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/github.md @@ -67,7 +67,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init github duckdb ``` @@ -110,16 +110,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python github_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `github_reactions`, you may @@ -137,7 +137,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This `dlt.source` function uses GraphQL to fetch DltResource objects: issues and pull requests along with associated reactions, comments, and reactions to comments. -```python +```py @dlt.source def github_reactions( owner: str, @@ -147,6 +147,7 @@ def github_reactions( max_items: int = None, max_item_age_seconds: float = None, ) -> Sequence[DltResource]: + ... ``` `owner`: Refers to the owner of the repository. @@ -169,7 +170,7 @@ yet to be implemented. Defaults to None. The `dlt.resource` function employs the `_get_reactions_data` method to retrieve data about issues, their associated comments, and subsequent reactions. -```python +```py dlt.resource( _get_reactions_data( "issues", @@ -193,11 +194,12 @@ on event type. It loads new events only and appends them to tables. > Note: Github allows retrieving up to 300 events for public repositories, so frequent updates are > recommended for active repos. -```python +```py @dlt.source(max_table_nesting=2) def github_repo_events( owner: str, name: str, access_token: str = None ) -> DltResource: + ... ``` `owner`: Refers to the owner of the repository. @@ -216,13 +218,14 @@ Read more about [nesting levels](../../general-usage/source#reduce-the-nesting-l This `dlt.resource` function serves as the resource for the `github_repo_events` source. It yields repository events as data items. -```python +```py dlt.resource(primary_key="id", table_name=lambda i: i["type"]) # type: ignore def repo_events( last_created_at: dlt.sources.incremental[str] = dlt.sources.incremental( "created_at", initial_value="1970-01-01T00:00:00Z", last_value_func=max ) ) -> Iterator[TDataItems]: + ... ``` `primary_key`: Serves as the primary key, instrumental in preventing data duplication. @@ -244,7 +247,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="github_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -258,7 +261,7 @@ verified source. 1. To load all the data from repo on issues, pull requests, their comments and reactions, you can do the following: - ```python + ```py load_data = github_reactions("duckdb", "duckdb") load_info = pipeline.run(load_data) print(load_info) @@ -267,7 +270,7 @@ verified source. 1. To load only the first 100 issues, you can do the following: - ```python + ```py load_data = github_reactions("duckdb", "duckdb", max_items=100) load_info = pipeline.run(load_data.with_resources("issues")) print(load_info) @@ -276,7 +279,7 @@ verified source. 1. You can use fetch and process repo events data incrementally. It loads all data during the first run and incrementally in subsequent runs. - ```python + ```py load_data = github_repo_events( "duckdb", "duckdb", access_token=os.getenv(ACCESS_TOKEN) ) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md index b6a3a0a5a8..2d8be0b15d 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/google_analytics.md @@ -84,7 +84,7 @@ follow these steps: 1. Add the following scope: - ``` + ```text "https://www.googleapis.com/auth/analytics.readonly" ``` @@ -93,7 +93,7 @@ follow these steps: After configuring "client_id", "client_secret", and "project_id" in "secrets.toml", to generate the refresh token, run the following script from the root folder: -```bash +```sh python google_analytics/setup_script_gcp_oauth.py ``` @@ -128,7 +128,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init google_analytics duckdb ``` @@ -214,16 +214,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python google_analytics_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is @@ -241,7 +241,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function returns a list of resources including metadata, metrics, and dimensions data from the Google Analytics API. -```python +```py @dlt.source(max_table_nesting=2) def google_analytics( credentials: Union[ GcpOAuthCredentials, GcpServiceAccountCredential ] = dlt.secrets.value, @@ -250,6 +250,7 @@ def google_analytics( start_date: Optional[str] = START_DATE, rows_per_page: int = 1000, ) -> List[DltResource]: + ... ``` `credentials`: GCP OAuth or service account credentials. @@ -269,9 +270,10 @@ set to 1000. This function retrieves all the metrics and dimensions for a report from a Google Analytics project. -```python +```py @dlt.resource(selected=False) def get_metadata(client: Resource, property_id: int) -> Iterator[Metadata]: + ... ``` `client`: This is the Google Analytics client used to make requests. @@ -284,7 +286,7 @@ def get_metadata(client: Resource, property_id: int) -> Iterator[Metadata]: This transformer function extracts data using metadata and populates a table called "metrics" with the data from each metric. -```python +```py @dlt.transformer(data_from=get_metadata, write_disposition="replace", name="metrics") def metrics_table(metadata: Metadata) -> Iterator[TDataItem]: for metric in metadata.metrics: @@ -304,7 +306,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="google_analytics", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -317,7 +319,7 @@ verified source. 1. To load all the data from metrics and dimensions: - ```python + ```py load_data = google_analytics() load_info = pipeline.run(load_data) print(load_info) @@ -328,7 +330,7 @@ verified source. 1. To load data from a specific start date: - ```python + ```py load_data = google_analytics(start_date='2023-01-01') load_info = pipeline.run(load_data) print(load_info) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md b/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md index 2a5d4b03ab..be12f5aea4 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md @@ -87,7 +87,7 @@ follow these steps: 1. Add the following scope: - ``` + ```text "https://www.googleapis.com/auth/spreadsheets.readonly" ``` @@ -98,7 +98,7 @@ follow these steps: After configuring "client_id", "client_secret" and "project_id" in "secrets.toml". To generate the refresh token, run the following script from the root folder: - ```bash + ```sh python google_sheets/setup_script_gcp_oauth.py ``` @@ -128,13 +128,13 @@ following: When setting up the pipeline, you can use either the browser-copied URL of your spreadsheet: -```bash +```sh https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing ``` or spreadsheet id (which is a part of the url) -```bash +```sh 1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4 ``` @@ -183,7 +183,7 @@ converted into tables, named after them and stored in the destination. 1. In range_names, you can enter as follows: - ``` + ```text range_names = ["Range_1","Range_2","Sheet1!A1:D10"] ``` @@ -214,7 +214,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init google_sheets duckdb ``` @@ -296,20 +296,20 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python google_sheets_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -328,7 +328,7 @@ Also, since recently `dlt`'s no longer recognizing date and time types, so you h Use the `apply_hints` method on the resource to achieve this. Here's how you can do it: -```python +```py for resource in resources: resource.apply_hints(columns={ "total_amount": {"data_type": "double"}, @@ -340,7 +340,7 @@ This will ensure that all values in the `total_amount` column are treated as `do And `date` column will be represented as dates, not integers. For a single resource (e.g. `Sheet1`), you can simply use: -```python +```py source.Sheet1.apply_hints(columns={ "total_amount": {"data_type": "double"}, "date": {"data_type": "timestamp"}, @@ -348,7 +348,7 @@ source.Sheet1.apply_hints(columns={ ``` To get the name of resources, you can use: -```python +```py print(source.resources.keys()) ``` @@ -371,7 +371,7 @@ or set `full_refresh=True`. This function loads data from a Google Spreadsheet. It retrieves data from all specified ranges, whether explicitly defined or named, and obtains metadata for the first two rows within each range. -```python +```py def google_spreadsheet( spreadsheet_url_or_id: str = dlt.config.value, range_names: Sequence[str] = dlt.config.value, @@ -381,6 +381,7 @@ def google_spreadsheet( get_sheets: bool = False, get_named_ranges: bool = True, ) -> Iterable[DltResource]: + ... ``` `spreadsheet_url_or_id`: ID or URL of the Google Spreadsheet. @@ -399,7 +400,7 @@ def google_spreadsheet( This function processes each range name provided by the source function, loading its data into separate tables in the destination. -```python +```py dlt.resource( process_range(rows_data, headers=headers, data_types=data_types), name=name, @@ -429,7 +430,7 @@ This table refreshes after each load, storing information on loaded ranges: - Range name as given to the source. - String and parsed representation of the loaded range. -```python +```py dlt.resource( metadata_table, write_disposition="merge", @@ -457,7 +458,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="google_sheets", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -467,7 +468,7 @@ verified source. 1. To load data from explicit range names: - ```python + ```py load_data = google_spreadsheet( "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL range_names=["range_name1", "range_name2"], # Range names @@ -483,7 +484,7 @@ verified source. 1. To load all the range_names from spreadsheet: - ```python + ```py load_data = google_spreadsheet( "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL get_sheets=False, @@ -497,7 +498,7 @@ verified source. 1. To load all the sheets from spreadsheet: - ```python + ```py load_data = google_spreadsheet( "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL get_sheets=True, @@ -511,7 +512,7 @@ verified source. 1. To load all the sheets and range_names: - ```python + ```py load_data = google_spreadsheet( "https://docs.google.com/spreadsheets/d/1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL get_sheets=True, @@ -525,7 +526,7 @@ verified source. 1. To load data from multiple spreadsheets: - ```python + ```py load_data1 = google_spreadsheet( "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL range_names=["Sheet 1!A1:B10"], @@ -543,7 +544,7 @@ verified source. 1. To load with table rename: - ```python + ```py load_data = google_spreadsheet( "https://docs.google.com/spreadsheets/d/43lkHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580/edit#gid=0", #Spreadsheet URL range_names=["Sheet 1!A1:B10"], @@ -554,7 +555,6 @@ verified source. load_info = pipeline.run(load_data) print(load_info) - } ``` ### Using Airflow with Google Spreadsheets: @@ -583,7 +583,7 @@ Below is the correct way to set up an Airflow DAG for this purpose: - When adding the Google Spreadsheet task to the pipeline, avoid decomposing it; run it as a single task for efficiency. -```python +```py @dag( schedule_interval='@daily', start_date=pendulum.datetime(2023, 2, 1), diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/hubspot.md b/docs/website/docs/dlt-ecosystem/verified-sources/hubspot.md index 3a623c7b49..8a6e1d1bb3 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/hubspot.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/hubspot.md @@ -55,7 +55,7 @@ Follow these steps: - Read scopes for CMS, CRM, and Settings. - Permissions for: - ``` + ```text business-intelligence, actions, crm.export, e-commerce, oauth, tickets ``` @@ -74,7 +74,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init hubspot duckdb ``` @@ -115,16 +115,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python hubspot_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `hubspot_pipeline`, you may @@ -148,12 +148,13 @@ it is important to note the complete list of the default endpoints given in This function returns a list of resources to load companies, contacts, deals, tickets, products, and web analytics events data into the destination. -```python +```py @dlt.source(name="hubspot") def hubspot( api_key: str = dlt.secrets.value, include_history: bool = False, ) -> Sequence[DltResource]: + ... ``` `api_key`: The key used to authenticate with the HubSpot API. Configured in "secrets.toml". @@ -166,7 +167,7 @@ specified entities. This resource function fetches data from the "companies" endpoint and loads it to the destination, replacing any existing data. -```python +```py @dlt.resource(name="companies", write_disposition="replace") def companies( api_key: str = api_key, @@ -195,7 +196,7 @@ in addition to the custom properties. Similar to this, resource functions "conta This function loads web analytics events for specific objects from Hubspot API into the destination. -```python +```py @dlt.resource def hubspot_events_for_objects( object_type: THubspotObjectType, @@ -203,6 +204,7 @@ def hubspot_events_for_objects( api_key: str = dlt.secrets.value, start_date: pendulum.DateTime = STARTDATE, ) -> DltResource: + ... ``` `object_type`: One of the Hubspot object types as defined in @@ -225,7 +227,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="hubspot", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -238,7 +240,7 @@ verified source. 1. To load all the data from contacts, companies, deals, products, tickets, and quotes into the destination. - ```python + ```py load_data = hubspot() load_info = pipeline.run(load_data) print(load_info) @@ -246,7 +248,7 @@ verified source. 1. To load data from contacts and companies, with time history using "with_resources" method. - ```python + ```py load_data = hubspot(include_history=True).with_resources("companies","contacts") load_info = pipeline.run(load_data) print(load_info) @@ -256,7 +258,7 @@ verified source. 1. By default, all the custom properties of a CRM object are extracted. If you want only particular fields, set the flag `include_custom_props=False` and add a list of properties with the `props` arg. - ```python + ```py load_data = hubspot() load_data.contacts.bind(props=["date_of_birth", "degree"], include_custom_props=False) load_info = pipeline.run(load_data.with_resources("contacts")) @@ -264,7 +266,7 @@ verified source. 1. If you want to read all the custom properties of CRM objects and some additional (e.g. Hubspot driven) properties. - ```python + ```py load_data = hubspot() load_data.contacts.bind(props=["hs_content_membership_email", "hs_content_membership_email_confirmed"]) load_info = pipeline.run(load_data.with_resources("contacts")) @@ -273,7 +275,7 @@ verified source. 1. To load the web analytics events of a given object type. - ```python + ```py resource = hubspot_events_for_objects("company", ["7086461639", "7086464459"]) # Here, object type : company, and object ids : 7086461639 and 7086464459 load_info = pipeline.run([resource]) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md index 75106df609..668d1ec470 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/inbox.md @@ -67,7 +67,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init inbox duckdb ``` @@ -112,7 +112,7 @@ For more information, read the 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` @@ -128,7 +128,7 @@ For more information, read the 2. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also @@ -145,7 +145,7 @@ For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs This function fetches inbox emails, saves attachments locally, and returns uids, messages, and attachments as resources. -```python +```py @dlt.source def inbox_source( host: str = dlt.secrets.value, @@ -158,6 +158,7 @@ def inbox_source( filter_by_mime_type: Sequence[str] = None, chunksize: int = DEFAULT_CHUNK_SIZE, ) -> Sequence[DltResource]: + ... ``` `host` : IMAP server hostname. Default: 'dlt.secrets.value'. @@ -182,13 +183,14 @@ def inbox_source( This resource collects email message UIDs (Unique IDs) from the mailbox. -```python +```py @dlt.resource(name="uids") def get_messages_uids( initial_message_num: Optional[ dlt.sources.incremental[int] ] = dlt.sources.incremental("message_uid", initial_value=1), ) -> TDataItem: + ... ``` `initial_message_num`: provides incremental loading on UID. @@ -197,12 +199,13 @@ def get_messages_uids( This resource retrieves emails by UID (Unique IDs), yielding a dictionary with metadata like UID, ID, sender, subject, dates, content type, and body. -```python +```py @dlt.transformer(name="messages", primary_key="message_uid") def get_messages( items: TDataItems, include_body: bool = True, ) -> TDataItem: + ... ``` `items`: An iterable containing dictionaries with 'message_uid' representing the email message UIDs. @@ -214,7 +217,7 @@ def get_messages( Similar to the previous resources, resource `get_attachments` extracts email attachments by UID from the IMAP server. It yields file items with attachments in the file_content field and the original email in the message field. -```python +```py @dlt.transformer( name="attachments", primary_key="file_hash", @@ -222,6 +225,7 @@ It yields file items with attachments in the file_content field and the original def get_attachments( items: TDataItems, ) -> Iterable[List[FileItem]]: + ... ``` `items`: An iterable containing dictionaries with 'message_uid' representing the email message UIDs. @@ -236,7 +240,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="standard_inbox", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -250,7 +254,7 @@ verified source. - Set `DEFAULT_START_DATE = pendulum.datetime(2023, 10, 1)` in `./inbox/settings.py`. - Use the following code: - ```python + ```py # Retrieve messages from the specified email address. messages = inbox_source(filter_emails=("mycreditcard@bank.com",)).messages # Configure messages to exclude body and name the result "my_inbox". @@ -263,7 +267,7 @@ verified source. > Please refer to inbox_source() docstring for email filtering options by sender, date, or mime type. 3. To load messages from multiple emails, including "community@dlthub.com": - ```python + ```py messages = inbox_source( filter_emails=("mycreditcard@bank.com", "community@dlthub.com.") ).messages @@ -272,7 +276,7 @@ verified source. 4. In `inbox_pipeline.py`, the `pdf_to_text` transformer extracts text from PDFs, treating each page as a separate data item. Using the `pdf_to_text` function to load parsed pdfs from mail to the database: - ```python + ```py filter_emails = ["mycreditcard@bank.com", "community@dlthub.com."] # Email senders attachments = inbox_source( filter_emails=filter_emails, filter_by_mime_type=["application/pdf"] diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/jira.md b/docs/website/docs/dlt-ecosystem/verified-sources/jira.md index c796014835..068251a927 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/jira.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/jira.md @@ -51,7 +51,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init jira duckdb ``` @@ -102,16 +102,16 @@ For more information, read [General Usage: Credentials.](../../general-usage/cre 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python jira_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `jira_pipeline`. You may also @@ -134,13 +134,14 @@ it is important to note the complete list of the default endpoints given in This source function creates a list of resources to load data into the destination. -```python +```py @dlt.source def jira( subdomain: str = dlt.secrets.value, email: str = dlt.secrets.value, api_token: str = dlt.secrets.value, ) -> Iterable[DltResource]: + ... ``` - `subdomain`: The subdomain of the Jira account. Configured in ".dlt/secrets.toml". @@ -152,13 +153,14 @@ def jira( This function returns a resource for querying issues using JQL [(Jira Query Language)](https://support.atlassian.com/jira-service-management-cloud/docs/use-advanced-search-with-jira-query-language-jql/). -```python +```py @dlt.source def jira_search( subdomain: str = dlt.secrets.value, email: str = dlt.secrets.value, api_token: str = dlt.secrets.value, ) -> Iterable[DltResource]: + ... ``` The above function uses the same arguments `subdomain`, `email`, and `api_token` as described above @@ -168,7 +170,7 @@ for the [jira source](jira.md#source-jira). The resource function searches issues using JQL queries and then loads them to the destination. -```python +```py @dlt.resource(write_disposition="replace") def issues(jql_queries: List[str]) -> Iterable[TDataItem]: api_path = "rest/api/3/search" @@ -186,7 +188,7 @@ above. about pipeline configuration, please refer to our documentation [here](https://dlthub.com/docs/general-usage/pipeline): - ```python + ```py pipeline = dlt.pipeline( pipeline_name="jira_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -196,7 +198,7 @@ above. 2. To load custom endpoints such as “issues” and “users” using the jira source function: - ```python + ```py #Run the pipeline load_info = pipeline.run(jira().with_resources("issues","users")) print(f"Load Information: {load_info}") @@ -205,11 +207,11 @@ above. 3. To load the custom issues using JQL queries, you can use custom queries. Here is an example below: - ```python + ```py # Define the JQL queries as follows queries = [ "created >= -30d order by created DESC", - "created >= -30d AND project = DEV AND issuetype = Epic AND status = "In Progress" order by created DESC", + 'created >= -30d AND project = DEV AND issuetype = Epic AND status = "In Progress" order by created DESC', ] # Run the pipeline load_info = pipeline.run(jira_search().issues(jql_queries=queries)) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md b/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md index 5bff03e357..87fa2d6927 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/kafka.md @@ -38,7 +38,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init kafka duckdb ``` @@ -80,20 +80,20 @@ sasl_password="example_secret" 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 2. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python kafka_pipeline.py ``` 3. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -108,7 +108,7 @@ For more information, read the [Walkthrough: Run a pipeline](../../walkthroughs/ This function retrieves messages from the given Kafka topics. -```python +```py @dlt.resource(name="kafka_messages", table_name=lambda msg: msg["_kafka"]["topic"], standalone=True) def kafka_consumer( topics: Union[str, List[str]], @@ -118,6 +118,7 @@ def kafka_consumer( batch_timeout: Optional[int] = 3, start_from: Optional[TAnyDateTime] = None, ) -> Iterable[TDataItem]: + ... ``` `topics`: A list of Kafka topics to be extracted. @@ -151,7 +152,7 @@ this offset. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="kafka", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -161,7 +162,7 @@ this offset. 2. To extract several topics: - ```python + ```py topics = ["topic1", "topic2", "topic3"] source = kafka_consumer(topics) @@ -170,7 +171,7 @@ this offset. 3. To extract messages and process them in a custom way: - ```python + ```py def custom_msg_processor(msg: confluent_kafka.Message) -> Dict[str, Any]: return { "_kafka": { @@ -187,7 +188,7 @@ this offset. 4. To extract messages, starting from a timestamp: - ```python + ```py data = kafka_consumer("topic", start_from=pendulum.datetime(2023, 12, 15)) pipeline.run(data) ``` diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md b/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md index 45841850c6..8be748b1a3 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/matomo.md @@ -44,7 +44,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init matomo duckdb ``` @@ -102,16 +102,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python matomo_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `matomo`, you may also @@ -128,7 +128,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function executes and loads a set of reports defined in "queries" for a specific Matomo site identified by "site_id". -```python +```py @dlt.source(max_table_nesting=2) def matomo_reports( api_token: str = dlt.secrets.value, @@ -136,6 +136,7 @@ def matomo_reports( queries: List[DictStrAny] = dlt.config.value, site_id: int = dlt.config.value, ) -> Iterable[DltResource]: + ... ``` `api_token`: API access token for Matomo server authentication, defaults to "./dlt/secrets.toml" @@ -152,7 +153,7 @@ def matomo_reports( The function loads visits from current day and the past `initial_load_past_days` in first run. In subsequent runs it continues from last load and skips active visits until closed. -```python +```py def matomo_visits( api_token: str = dlt.secrets.value, url: str = dlt.config.value, @@ -162,6 +163,7 @@ def matomo_visits( visit_max_duration_seconds: int = 3600, get_live_event_visitors: bool = False, ) -> List[DltResource]: + ... ``` `api_token`: API token for authentication, defaulting to "./dlt/secrets.toml". @@ -184,7 +186,7 @@ def matomo_visits( This function retrieves site visits within a specified timeframe. If a start date is given, it begins from that date. If not, it retrieves all visits up until now. -```python +```py @dlt.resource( name="visits", write_disposition="append", primary_key="idVisit", selected=True ) @@ -196,6 +198,7 @@ def get_last_visits( visit_max_duration_seconds: int = 3600, rows_per_page: int = 2000, ) -> Iterator[TDataItem]: + ... ``` `site_id`: Unique ID for each Matomo site. @@ -215,7 +218,7 @@ def get_last_visits( This function, retrieves unique visit information from get_last_visits. -```python +```py @dlt.transformer( data_from=get_last_visits, write_disposition="merge", @@ -225,6 +228,7 @@ This function, retrieves unique visit information from get_last_visits. def get_unique_visitors( visits: List[DictStrAny], client: MatomoAPIClient, site_id: int ) -> Iterator[TDataItem]: + ... ``` `visits`: Recent visit data within the specified timeframe. @@ -242,7 +246,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="matomo", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -255,7 +259,7 @@ verified source. 1. To load the data from reports. - ```python + ```py data_reports = matomo_reports() load_info = pipeline_reports.run(data_reports) print(load_info) @@ -264,7 +268,7 @@ verified source. 1. To load custom data from reports using queries. - ```python + ```py queries = [ { "resource_name": "custom_report_name", @@ -285,7 +289,7 @@ verified source. 1. To load data from reports and visits. - ```python + ```py data_reports = matomo_reports() data_events = matomo_visits() load_info = pipeline_reports.run([data_reports, data_events]) @@ -294,7 +298,7 @@ verified source. 1. To load data on live visits and visitors, and only retrieve data from today. - ```python + ```py load_data = matomo_visits(initial_load_past_days=1, get_live_event_visitors=True) load_info = pipeline_events.run(load_data) print(load_info) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/mongodb.md b/docs/website/docs/dlt-ecosystem/verified-sources/mongodb.md index 9178d2ab6d..a30eb3f248 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/mongodb.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/mongodb.md @@ -66,30 +66,30 @@ Here are the typical ways to configure MongoDB and their connection URLs: 1. Connect to MongoDB: - ```bash + ```sh mongo "mongodb://dbuser:passwd@your_host:27017" ``` 1. List all Databases: - ```bash + ```sh show dbs ``` 1. View Collections in a Database: 1. Switch to Database: - ```bash + ```sh use your_database_name ``` 1. Display its Collections: - ```bash + ```sh show collections ``` 1. Disconnect: - ```bash + ```sh exit ``` @@ -115,7 +115,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init mongodb duckdb ``` @@ -174,16 +174,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python mongodb_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `local_mongo`, you may also @@ -200,7 +200,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function loads data from a MongoDB database, yielding one or multiple collections to be retrieved. -```python +```py @dlt.source def mongodb( connection_url: str = dlt.secrets.value, @@ -209,6 +209,7 @@ def mongodb( incremental: Optional[dlt.sources.incremental] = None, # type: ignore[type-arg] write_disposition: Optional[str] = dlt.config.value, ) -> Iterable[DltResource]: + ... ``` `connection_url`: MongoDB connection URL. @@ -226,7 +227,7 @@ def mongodb( This function fetches a single collection from a MongoDB database using PyMongo. -```python +```py def mongodb_collection( connection_url: str = dlt.secrets.value, database: Optional[str] = dlt.config.value, @@ -234,6 +235,7 @@ def mongodb_collection( incremental: Optional[dlt.sources.incremental] = None, # type: ignore[type-arg] write_disposition: Optional[str] = dlt.config.value, ) -> Any: + ... ``` `collection`: Name of the collection to load. @@ -247,7 +249,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="mongodb_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -257,7 +259,7 @@ verified source. 1. To load all the collections in a database: - ```python + ```py load_data = mongodb() load_info = pipeline.run(load_data, write_disposition="replace") print(load_info) @@ -265,7 +267,7 @@ verified source. 1. To load a specific collections from the database: - ```python + ```py load_data = mongodb().with_resources("collection_1", "collection_2") load_info = pipeline.run(load_data, write_disposition="replace") print(load_info) @@ -273,7 +275,7 @@ verified source. 1. To load specific collections from the source incrementally: - ```python + ```py load_data = mongodb(incremental=dlt.sources.incremental("date")).with_resources("collection_1") load_info = pipeline.run(load_data, write_disposition = "merge") print(load_info) @@ -282,7 +284,7 @@ verified source. 1. To load data from a particular collection say "movies" incrementally: - ```python + ```py load_data = mongodb_collection( collection="movies", incremental=dlt.sources.incremental( @@ -300,7 +302,7 @@ verified source. 1. To incrementally load a table with an append-only disposition using hints: - ```python + ```py # Suitable for tables where new rows are added, but existing rows aren't updated. # Load data from the 'listingsAndReviews' collection in MongoDB, using 'last_scraped' for incremental addition. airbnb = mongodb().with_resources("listingsAndReviews") @@ -317,7 +319,7 @@ verified source. 1. To load a selected collection and rename it in the destination: - ```python + ```py # Create the MongoDB source and select the "collection_1" collection source = mongodb().with_resources("collection_1") diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/mux.md b/docs/website/docs/dlt-ecosystem/verified-sources/mux.md index a713121f29..338611e657 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/mux.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/mux.md @@ -46,7 +46,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init mux duckdb ``` @@ -88,16 +88,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python mux_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is @@ -115,7 +115,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function yields resources "asset_resource" and "views_resource" to load video assets and views. -```python +```py @dlt.source def mux_source() -> Iterable[DltResource]: yield assets_resource @@ -126,13 +126,14 @@ def mux_source() -> Iterable[DltResource]: The assets_resource function fetches metadata about video assets from the Mux API's "assets" endpoint. -```python +```py @dlt.resource(write_disposition="merge") def assets_resource( mux_api_access_token: str = dlt.secrets.value, mux_api_secret_key: str = dlt.secrets.value, limit: int = DEFAULT_LIMIT, ) -> Iterable[TDataItem]: + ... ``` `mux_api_access_token`: Mux API token for authentication, defaults to ".dlt/secrets.toml". @@ -145,13 +146,14 @@ def assets_resource( This function yields data about every video view from yesterday to be loaded. -```python +```py @dlt.resource(write_disposition="append") def views_resource( mux_api_access_token: str = dlt.secrets.value, mux_api_secret_key: str = dlt.secrets.value, limit: int = DEFAULT_LIMIT, ) -> Iterable[DltResource]: + ... ``` The arguments `mux_api_access_token`, `mux_api_secret_key` and `limit` are the same as described [above](#resource-assets_resource) in "asset_resource". @@ -165,7 +167,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="mux_pipeline", # Use a custom name if desired destination="bigquery", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -175,21 +177,21 @@ verified source. 1. To load metadata about every asset to be loaded: - ```python - load_info = pipeline.run(mux_source().with_resources("assets_resource") + ```py + load_info = pipeline.run(mux_source().with_resources("assets_resource")) print(load_info) ``` 1. To load data for each video view from yesterday: - ```python - load_info = pipeline.run(mux_source().with_resources("views_resource") + ```py + load_info = pipeline.run(mux_source().with_resources("views_resource")) print(load_info) ``` 1. To load both metadata about assets and video views from yesterday: - ```python + ```py load_info = pipeline.run(mux_source()) print(load_info) ``` diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/notion.md b/docs/website/docs/dlt-ecosystem/verified-sources/notion.md index ffb0becfbb..650fc10fde 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/notion.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/notion.md @@ -50,7 +50,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init notion duckdb ``` @@ -93,16 +93,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python notion_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `notion`, you may also use any @@ -119,12 +119,13 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function loads notion databases from notion into the destination. -```python +```py @dlt.source def notion_databases( database_ids: Optional[List[Dict[str, str]]] = None, api_key: str = dlt.secrets.value, ) -> Iterator[DltResource]: + ... ``` `database_ids`: A list of dictionaries each containing a database id and a name. @@ -146,7 +147,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="notion", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -159,7 +160,7 @@ verified source. 1. To load all the integrated databases: - ```python + ```py load_data = notion_databases() load_info = pipeline.run(load_data) print(load_info) @@ -167,7 +168,7 @@ verified source. 1. To load the custom databases: - ```python + ```py selected_database_ids = [{"id": "0517dae9409845cba7d","use_name":"db_one"}, {"id": "d8ee2d159ac34cfc"}] load_data = notion_databases(database_ids=selected_database_ids) load_info = pipeline.run(load_data) @@ -176,7 +177,7 @@ verified source. The Database ID can be retrieved from the URL. For example if the URL is: - ```shell + ```sh https://www.notion.so/d8ee2d159ac34cfc85827ba5a0a8ae71?v=c714dec3742440cc91a8c38914f83b6b ``` diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/personio.md b/docs/website/docs/dlt-ecosystem/verified-sources/personio.md index 6fae36d0ec..af951bd21a 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/personio.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/personio.md @@ -57,7 +57,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init personio duckdb ``` @@ -102,16 +102,16 @@ For more information, read [Credentials](../../general-usage/credentials). 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python personio_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `personio`, you may also use @@ -127,7 +127,7 @@ For more information, read [Run a pipeline.](../../walkthroughs/run-a-pipeline) ### Source `personio_source` This `dlt` source returns data resources like `employees`, `absences`, `absence_types`, etc. -```python +```py @dlt.source(name="personio") def personio_source( client_id: str = dlt.secrets.value, @@ -158,8 +158,8 @@ def personio_source( This resource retrieves data on all the employees in a company. -```python - @dlt.resource(primary_key="id", write_disposition="merge") +```py +@dlt.resource(primary_key="id", write_disposition="merge") def employees( updated_at: dlt.sources.incremental[ pendulum.DateTime @@ -185,9 +185,10 @@ data incrementally from the Personio API to your preferred destination. ### Resource `absence_types` Simple resource, which retrieves a list of various types of employee absences. -```python +```py @dlt.resource(primary_key="id", write_disposition="replace") def absence_types(items_per_page: int = items_per_page) -> Iterable[TDataItem]: + ... ... ``` @@ -209,7 +210,7 @@ The transformer functions transform or process data from resources. The transformer function `employees_absences_balance` process data from the `employees` resource. It fetches and returns a list of the absence balances for each employee. -```python +```py @dlt.transformer( data_from=employees, write_disposition="merge", @@ -232,7 +233,7 @@ verified source. 1. Configure the [pipeline](../../general-usage/pipeline) by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="personio", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -242,14 +243,14 @@ verified source. 1. To load employee data: - ```python + ```py load_data = personio_source().with_resources("employees") print(pipeline.run(load_data)) ``` 1. To load data from all supported endpoints: - ```python + ```py load_data = personio_source() print(pipeline.run(load_data)) ``` diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md b/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md index 1da5205471..9b2c8a640f 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/pipedrive.md @@ -53,7 +53,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init pipedrive duckdb ``` @@ -93,16 +93,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python pipedrive_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `pipedrive`, but you may also use @@ -138,12 +138,13 @@ Pipedrive API. This function returns a list of resources including activities, deals, custom_fields_mapping and other resources data from Pipedrive API. -```python +```py @dlt.source(name="pipedrive") def pipedrive_source( pipedrive_api_key: str = dlt.secrets.value, since_timestamp: Optional[Union[pendulum.DateTime, str]] = dlt.config.value, ) -> Iterator[DltResource]: + ... ``` `pipedrive_api_key`: Authentication token for Pipedrive, configured in ".dlt/secrets.toml". @@ -159,7 +160,7 @@ This code generates resources for each entity in [RECENTS_ENTITIES](https://github.com/dlt-hub/verified-sources/blob/master/sources/pipedrive/settings.py), stores them in endpoints_resources, and then loads data from each endpoint to the destination. -```python +```py endpoints_resources = {} for entity, resource_name in RECENTS_ENTITIES.items(): endpoints_resources[resource_name] = dlt.resource( @@ -186,7 +187,7 @@ for entity, resource_name in RECENTS_ENTITIES.items(): This function gets the participants of deals from the Pipedrive API and yields the result. -```python +```py def pipedrive_source(args): # Rest of function yield endpoints_resources["deals"] | dlt.transformer( @@ -209,12 +210,13 @@ further processing or loading. This function preserves the mapping of custom fields across different pipeline runs. It is used to create and store a mapping of custom fields for different entities in the source state. -```python +```py @dlt.resource(selected=False) def create_state(pipedrive_api_key: str) -> Iterator[Dict[str, Any]]: def _get_pages_for_rename( entity: str, fields_entity: str, pipedrive_api_key: str ) -> Dict[str, Any]: + ... ``` It processes each entity in ENTITY_MAPPINGS, updating the custom fields mapping if a related fields @@ -238,7 +240,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="pipedrive", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -251,7 +253,7 @@ verified source. 1. To print source info: - ```python + ```py pipedrive_data = pipedrive_source() #print source info print(pipedrive_data) @@ -263,7 +265,7 @@ verified source. 1. To load all the data in Pipedrive: - ```python + ```py load_data = pipedrive_source() # calls the source function load_info = pipeline.run(load_data) #runs the pipeline with selected source configuration print(load_info) @@ -271,7 +273,7 @@ verified source. 1. To load data from selected resources: - ```python + ```py #To load custom fields, include custom_fields_mapping for hash to name mapping. load_data = pipedrive_source().with_resources("products", "deals", "deals_participants", "custom_fields_mapping") load_info = pipeline.run(load_data) #runs the pipeline loading selected data @@ -280,7 +282,7 @@ verified source. 1. To load data from a start date: - ```python + ```py # Configure a source for 'activities' starting from the specified date. # The 'custom_fields_mapping' is incorporated to convert custom field hashes into their respective names. activities_source = pipedrive_source( diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/salesforce.md b/docs/website/docs/dlt-ecosystem/verified-sources/salesforce.md index aa8fbe10d4..7d6b6e036a 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/salesforce.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/salesforce.md @@ -63,7 +63,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init salesforce duckdb ``` @@ -110,16 +110,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python salesforce_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `salesforce`, you may also use @@ -137,13 +137,14 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function returns a list of resources to load users, user_role, opportunity, opportunity_line_item, account etc. data from Salesforce API. -```python +```py @dlt.source(name="salesforce") def salesforce_source( user_name: str = dlt.secrets.value, password: str = dlt.secrets.value, security_token: str = dlt.secrets.value, ) ->Iterable[DltResource]: + ... ``` - `user_name`: Your Salesforce account username. @@ -156,7 +157,7 @@ def salesforce_source( This resource function retrieves records from the Salesforce "User" endpoint. -```python +```py @dlt.resource(write_disposition="replace") def sf_user() -> Iterator[Dict[str, Any]]: yield from get_records(client, "User") @@ -176,7 +177,7 @@ the "user_role" endpoint. This resource function retrieves records from the Salesforce "Opportunity" endpoint in incremental mode. -```python +```py @dlt.resource(write_disposition="merge") def opportunity( last_timestamp: Incremental[str] = dlt.sources.incremental( @@ -215,7 +216,7 @@ To create your data pipeline using single loading and 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="salesforce_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -228,7 +229,7 @@ To create your data pipeline using single loading and 1. To load data from all the endpoints, use the `salesforce_source` method as follows: - ```python + ```py load_data = salesforce_source() source.schema.merge_hints({"not_null": ["id"]}) # Hint for id field not null load_info = pipeline.run(load_data) @@ -241,7 +242,7 @@ To create your data pipeline using single loading and 1. To use the method `pipeline.run()` to load custom endpoints “candidates” and “members”: - ```python + ```py load_info = pipeline.run(load_data.with_resources("opportunity", "contact")) # print the information on data that was loaded print(load_info) @@ -260,7 +261,7 @@ To create your data pipeline using single loading and 1. To load data from the “contact” in replace mode and “task” incrementally merge mode endpoints: - ```python + ```py load_info = pipeline.run(load_data.with_resources("contact", "task")) # pretty print the information on data that was loaded print(load_info) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/shopify.md b/docs/website/docs/dlt-ecosystem/verified-sources/shopify.md index 09dc392c87..af00b17703 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/shopify.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/shopify.md @@ -61,7 +61,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init shopify_dlt duckdb ``` @@ -125,16 +125,16 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python shopify_dlt_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` For example, the `pipeline_name` for the above pipeline example is `shopify_data`, you may also @@ -152,7 +152,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function returns a list of resources to load products, orders, and customers data from Shopify API. -```python +```py def shopify_source( private_app_password: str = dlt.secrets.value, api_version: str = DEFAULT_API_VERSION, @@ -163,6 +163,7 @@ def shopify_source( items_per_page: int = DEFAULT_ITEMS_PER_PAGE, order_status: TOrderStatus = "any", ) -> Iterable[DltResource]: + ... ``` `private_app_password`: App's password for your shop. @@ -188,7 +189,7 @@ incremental loading if unspecified. This resource loads products from your Shopify shop into the destination. It supports incremental loading and pagination. -```python +```py @dlt.resource(primary_key="id", write_disposition="merge") def products( updated_at: dlt.sources.incremental[ @@ -202,6 +203,7 @@ def products( created_at_min: pendulum.DateTime = created_at_min_obj, items_per_page: int = items_per_page, ) -> Iterable[TDataItem]: + ... ``` `updated_at`: The saved [state](../../general-usage/state) of the last 'updated_at' value. @@ -212,7 +214,7 @@ support incremental loading and pagination. ### Resource `shopify_partner_query`: This resource can be used to run custom GraphQL queries to load paginated data. -```python +```py @dlt.resource def shopify_partner_query( query: str, @@ -224,6 +226,7 @@ def shopify_partner_query( organization_id: str = dlt.config.value, api_version: str = DEFAULT_PARTNER_API_VERSION, ) -> Iterable[TDataItem]: + ... ``` `query`: The GraphQL query for execution. @@ -251,7 +254,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="shopify", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -264,7 +267,7 @@ verified source. 1. To load data from "products", "orders" and "customers" from 1st Jan 2023. - ```python + ```py # Add your desired resources to the list... resources = ["products", "orders", "customers"] start_date="2023-01-01" @@ -278,7 +281,7 @@ verified source. minimizes potential failure during large data loads. Running chunks and incremental loads in parallel accelerates the initial load. - ```python + ```py # Load all orders from 2023-01-01 to now min_start_date = current_start_date = pendulum.datetime(2023, 1, 1) max_end_date = pendulum.now() @@ -310,7 +313,7 @@ verified source. print(load_info) ``` 1. To load the first 10 transactions via GraphQL query from the Shopify Partner API. - ```python + ```py # Construct query to load transactions 100 per page, the `$after` variable is used to paginate query = """query Transactions($after: String) { transactions(after: $after, first: 10) { diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/slack.md b/docs/website/docs/dlt-ecosystem/verified-sources/slack.md index 647e39a427..104eeff388 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/slack.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/slack.md @@ -67,7 +67,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init slack duckdb ``` @@ -107,20 +107,20 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python slack_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -138,7 +138,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage It retrieves data from Slack's API and fetches the Slack data such as channels, messages for selected channels, users, logs. -```python +```py @dlt.source(name="slack", max_table_nesting=2) def slack_source( page_size: int = MAX_PAGE_SIZE, @@ -147,6 +147,7 @@ def slack_source( end_date: Optional[TAnyDateTime] = None, selected_channels: Optional[List[str]] = dlt.config.value, ) -> Iterable[DltResource]: + ... ``` `page_size`: Maximum items per page (default: 1000). @@ -163,25 +164,27 @@ def slack_source( This function yields all the channels data as a `dlt` resource. -```python +```py @dlt.resource(name="channels", primary_key="id", write_disposition="replace") def channels_resource() -> Iterable[TDataItem]: + ... ``` ### Resource `users` This function yields all the users data as a `dlt` resource. -```python +```py @dlt.resource(name="users", primary_key="id", write_disposition="replace") def users_resource() -> Iterable[TDataItem]: + ... ``` ### Resource `get_messages_resource` This method fetches messages for a specified channel from the Slack API. It creates a resource for each channel with the channel's name. -```python +```py def get_messages_resource( channel_data: Dict[str, Any], created_at: dlt.sources.incremental[DateTime] = dlt.sources.incremental( @@ -191,6 +194,7 @@ def get_messages_resource( allow_external_schedulers=True, ), ) -> Iterable[TDataItem]: + ... ``` `channel_data`: A dictionary detailing a specific channel to determine where messages are fetched from. @@ -209,7 +213,7 @@ def get_messages_resource( This method retrieves access logs from the Slack API. -```python +```py @dlt.resource( name="access_logs", selected=False, @@ -218,6 +222,7 @@ This method retrieves access logs from the Slack API. ) # it is not an incremental resource it just has a end_date filter def logs_resource() -> Iterable[TDataItem]: + ... ``` `selected`: A boolean set to False, indicating the resource isn't loaded by default. @@ -235,7 +240,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="slack", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -244,7 +249,7 @@ verified source. ``` 1. To load Slack resources from the specified start date: - ```python + ```py source = slack_source(page_size=1000, start_date=datetime(2023, 9, 1), end_date=datetime(2023, 9, 8)) # Enable below to load only 'access_logs', available for paid accounts only. @@ -258,7 +263,7 @@ verified source. 1. To load data from selected Slack channels from the specified start date: - ```python + ```py # To load data from selected channels. selected_channels=["general", "random"] # Enter the channel names here. @@ -275,7 +280,7 @@ verified source. 1. To load only messages from selected Slack resources: - ```python + ```py # To load data from selected channels. selected_channels=["general", "random"] # Enter the channel names here. diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md index 67965863ce..56fc826ce8 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/sql_database.md @@ -58,8 +58,8 @@ The database above doesn't require a password. The connection URL can be broken down into: -```python -connection_url = "connection_string = f"{drivername}://{username}:{password}@{host}:{port}/{database}" +```py +connection_url = connection_string = f"{drivername}://{username}:{password}@{host}:{port}{database}" ``` `drivername`: Indicates both the database system and driver used. @@ -116,7 +116,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init sql_database duckdb ``` @@ -158,7 +158,7 @@ For more information, read the guide on [how to add a verified source](../../wal 1. You can also pass credentials in the pipeline script the following way: - ```python + ```py credentials = ConnectionStringCredentials( "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam" ) @@ -176,19 +176,19 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Install the necessary dependencies by running the following command: - ```bash + ```sh pip install -r requirements.txt ``` 1. Run the verified source by entering: - ```bash + ```sh python sql_database_pipeline.py ``` 1. Make sure that everything is loaded as expected with: - ```bash + ```sh dlt pipeline show ``` @@ -208,7 +208,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage This function loads data from an SQL database via SQLAlchemy and auto-creates resources for each table or from a specified list of tables. -```python +```py @dlt.source def sql_database( credentials: Union[ConnectionStringCredentials, Engine, str] = dlt.secrets.value, @@ -220,6 +220,7 @@ def sql_database( defer_table_reflect: Optional[bool] = dlt.config.value, table_adapter_callback: Callable[[Table], None] = None, ) -> Iterable[DltResource]: + ... ``` `credentials`: Database details or an 'sqlalchemy.Engine' instance. @@ -244,7 +245,7 @@ remove certain columns to be selected. This function loads data from specific database tables. -```python +```py @dlt.common.configuration.with_config( sections=("sources", "sql_database"), spec=SqlTableResourceConfiguration ) @@ -259,6 +260,7 @@ def sql_table( defer_table_reflect: Optional[bool] = dlt.config.value, table_adapter_callback: Callable[[Table], None] = None, ) -> DltResource: + ... ``` `incremental`: Optional, enables incremental loading. @@ -284,7 +286,7 @@ certain range. 1. Consider a table with a `last_modified` timestamp column. By setting this column as your cursor and specifying an initial value, the loader generates a SQL query filtering rows with `last_modified` values greater than the specified initial value. - ```python + ```py from sql_database import sql_table from datetime import datetime @@ -303,7 +305,7 @@ certain range. 1. To incrementally load the "family" table using the sql_database source method: - ```python + ```py source = sql_database().with_resources("family") #using the "updated" field as an incremental field using initial value of January 1, 2022, at midnight source.family.apply_hints(incremental=dlt.sources.incremental("updated"),initial_value=pendulum.DateTime(2022, 1, 1, 0, 0, 0)) @@ -315,7 +317,7 @@ certain range. 1. To incrementally load the "family" table using the 'sql_table' resource. - ```python + ```py family = sql_table( table="family", incremental=dlt.sources.incremental( @@ -342,7 +344,7 @@ When running on Airflow ### Parallel extraction You can extract each table in a separate thread (no multiprocessing at this point). This will decrease loading time if your queries take time to execute or your network latency/speed is low. -```python +```py database = sql_database().parallelize() table = sql_table().parallelize() ``` @@ -358,7 +360,7 @@ To create your own pipeline, use source and resource methods from this verified 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="rfam", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -370,7 +372,7 @@ To create your own pipeline, use source and resource methods from this verified 1. To load the entire database, use the `sql_database` source as: - ```python + ```py source = sql_database() info = pipeline.run(source, write_disposition="replace") print(info) @@ -378,7 +380,7 @@ To create your own pipeline, use source and resource methods from this verified 1. If you just need the "family" table, use: - ```python + ```py source = sql_database().with_resources("family") #running the pipeline info = pipeline.run(source, write_disposition="replace") @@ -389,7 +391,7 @@ To create your own pipeline, use source and resource methods from this verified [documentation](https://dlthub.com/docs/general-usage/customising-pipelines/pseudonymizing_columns). As an example, here's how to pseudonymize the "rfam_acc" column in the "family" table: - ```python + ```py import hashlib def pseudonymize_name(doc): @@ -421,7 +423,7 @@ To create your own pipeline, use source and resource methods from this verified 1. To exclude columns, such as the "rfam_id" column from the "family" table before loading: - ```python + ```py def remove_columns(doc): del doc["rfam_id"] return doc diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/strapi.md b/docs/website/docs/dlt-ecosystem/verified-sources/strapi.md index 4ddf20aa78..0ac1fe7acf 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/strapi.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/strapi.md @@ -50,7 +50,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init strapi duckdb ``` @@ -73,7 +73,7 @@ For more information, read the guide on [how to add a verified source](../../wal information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: - ```python + ```py # put your secret values and credentials here. do not share this file and do not push it to github [sources.strapi] api_secret_key = "api_secret_key" # please set me up! @@ -96,13 +96,13 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python strapi_pipeline.py ``` @@ -113,7 +113,7 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -131,13 +131,14 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function retrives data from Strapi. -```python +```py @dlt.source def strapi_source( endpoints: List[str], api_secret_key: str = dlt.secrets.value, domain: str = dlt.secrets.value, ) -> Iterable[DltResource]: + ... ``` `endpoints`: Collections to fetch data from. @@ -155,7 +156,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="strapi", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -165,7 +166,7 @@ verified source. 1. To load the specified endpoints: - ```python + ```py endpoints = ["athletes"] load_data = strapi_source(endpoints=endpoints) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md b/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md index 0b172dc3be..118c0e6511 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/stripe.md @@ -56,7 +56,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init stripe_analytics duckdb ``` @@ -96,20 +96,20 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python stripe_analytics_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -127,7 +127,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug You can write your own pipelines to load data to a destination using this verified source. However, it is important to note is how the `ENDPOINTS` and `INCREMENTAL_ENDPOINTS` tuples are defined in `stripe_analytics/settings.py`. -```python +```py # The most popular Stripe API's endpoints ENDPOINTS = ("Subscription", "Account", "Coupon", "Customer", "Product", "Price") # Possible incremental endpoints @@ -140,7 +140,7 @@ INCREMENTAL_ENDPOINTS = ("Event", "Invoice", "BalanceTransaction") This function retrieves data from the Stripe API for the specified endpoint: -```python +```py @dlt.source def stripe_source( endpoints: Tuple[str, ...] = ENDPOINTS, @@ -148,6 +148,7 @@ def stripe_source( start_date: Optional[DateTime] = None, end_date: Optional[DateTime] = None, ) -> Iterable[DltResource]: + ... ``` - `endpoints`: Tuple containing endpoint names. @@ -159,7 +160,7 @@ def stripe_source( This source loads data in 'append' mode from incremental endpoints. -```python +```py @dlt.source def incremental_stripe_source( endpoints: Tuple[str, ...] = INCREMENTAL_ENDPOINTS, @@ -167,6 +168,7 @@ def incremental_stripe_source( initial_start_date: Optional[DateTime] = None, end_date: Optional[DateTime] = None, ) -> Iterable[DltResource]: + ... ``` `endpoints`: Tuple containing incremental endpoint names. @@ -183,9 +185,10 @@ For more information, read the [General Usage: Incremental loading](../../genera This function loads a dictionary with calculated metrics, including MRR and Churn rate, along with the current timestamp. -```python +```py @dlt.resource(name="Metrics", write_disposition="append", primary_key="created") def metrics_resource() -> Iterable[TDataItem]: + ... ``` Abrevations MRR and Churn rate are as follows: @@ -203,7 +206,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="stripe_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -213,7 +216,7 @@ verified source. 1. To load endpoints like "Plan" and "Charge" in replace mode, retrieve all data for the year 2022: - ```python + ```py source_single = stripe_source( endpoints=("Plan", "Charge"), start_date=datetime(2022, 1, 1), @@ -225,7 +228,7 @@ verified source. 1. To load data from the "Invoice" endpoint, which has static data, using incremental loading: - ```python + ```py # Load all data on the first run that was created after start_date and before end_date source_incremental = incremental_stripe_source( endpoints=("Invoice", ), @@ -239,7 +242,7 @@ verified source. 1. To load data created after December 31, 2022, adjust the data range for stripe_source to prevent redundant loading. For incremental_stripe_source, the initial_start_date will auto-update to the last loaded date from the previous run. - ```python + ```py source_single = stripe_source( endpoints=("Plan", "Charge"), start_date=datetime(2022, 12, 31), @@ -254,7 +257,7 @@ verified source. 1. To load important metrics and store them in database: - ```python + ```py # Event is an endpoint with uneditable data, so we can use 'incremental_stripe_source'. source_event = incremental_stripe_source(endpoints=("Event",)) # Subscription is an endpoint with editable data, use stripe_source. diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/workable.md b/docs/website/docs/dlt-ecosystem/verified-sources/workable.md index 8701db7db8..dc4c1936f9 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/workable.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/workable.md @@ -65,7 +65,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init workable duckdb ``` @@ -117,20 +117,20 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python workable_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -146,7 +146,7 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug Note the default definitions of DEFAULT_ENDPOINTS and DEFAULT_DETAILS in "workable/settings.py". -```python +```py DEFAULT_ENDPOINTS = ("members", "recruiters", "stages", "requisitions", "jobs", "custom_attributes","events") DEFAULT_DETAILS = { @@ -164,7 +164,7 @@ endpoints allow incremental 'merge' mode loading. This source returns a sequence of dltResources that correspond to the endpoints. -```python +```py @dlt.source(name="workable") def workable_source( access_token: str = dlt.secrets.value, @@ -172,6 +172,7 @@ def workable_source( start_date: Optional[DateTime] = None, load_details: bool = False, ) -> Iterable[DltResource]: + ... ``` `access_token`: Authenticate the Workable API using the token specified in ".dlt/secrets.toml". @@ -187,13 +188,14 @@ def workable_source( This function is used to retrieve "candidates" endpoints. -```python +```py @dlt.resource(name="candidates", write_disposition="merge", primary_key="id") def candidates_resource( updated_at: Optional[Any] = dlt.sources.incremental( "updated_at", initial_value=workable.start_date_iso ) ) -> Iterable[TDataItem]: + ... ``` `updated_at`: Uses the dlt.sources.incremental method. Defaults to the function's start_date or Jan @@ -211,7 +213,7 @@ To create your data pipeline using single loading and 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="workable", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -221,7 +223,7 @@ To create your data pipeline using single loading and 1. To load all data: - ```python + ```py load_data = workable_source() load_info = pipeline.run(load_data) print(load_info) @@ -232,7 +234,7 @@ To create your data pipeline using single loading and 1. To load data from a specific date, including dependent endpoints: - ```python + ```py load_data = workable_source(start_date=datetime(2022, 1, 1), load_details=True) load_info = pipeline.run(load_data) print(load_info) @@ -244,8 +246,8 @@ To create your data pipeline using single loading and 1. To load custom endpoints “candidates” and “members”: - ```python - load_info = pipeline.run(load_data.with_resources("candidates", "members") + ```py + load_info = pipeline.run(load_data.with_resources("candidates", "members")) # print the information on data that was loaded print(load_info) ``` @@ -255,7 +257,7 @@ To create your data pipeline using single loading and 1. To load data from the “jobs” endpoint and its dependent endpoints like "activities" and "application_form": - ```python + ```py load_data = workable_source(start_date=datetime(2022, 2, 1), load_details=True) # Set the load_details as True to load all the dependent endpoints. load_info = pipeline.run(load_data.with_resources("jobs","jobs_activities","jobs_application_form")) diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/zendesk.md b/docs/website/docs/dlt-ecosystem/verified-sources/zendesk.md index 234483dca0..11567306d9 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/zendesk.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/zendesk.md @@ -84,7 +84,7 @@ Here's a summarized version: 1. To get full token using the client id obtained above, you can follow the [instructions here.](https://developer.zendesk.com/documentation/ticketing/working-with-oauth/creating-and-using-oauth-tokens-with-the-api/#creating-the-access-token) - ```curl + ```sh curl https://{subdomain}.zendesk.com/api/v2/oauth/tokens.json \ -X POST \ -v -u {email_address}:{password} \ @@ -129,7 +129,7 @@ To generate Zendesk chat OAuth token, please refer to this 1. Record the "CLIENT_ID" and "SUBDOMAIN". 1. Format the below URL with your own CLIENT_ID and SUBDOMAIN, paste it into a new browser tab, and press Enter. - ```bash + ```sh https://www.zopim.com/oauth2/authorizations/new?response_type=token&client_id=CLIENT_ID&scope=read%20write&subdomain=SUBDOMAIN ``` 1. The call will be made, possibly asking you to log in and select 'Allow' to generate the token. @@ -160,7 +160,7 @@ To get started with your data pipeline, follow these steps: 1. Enter the following command: - ```bash + ```sh dlt init zendesk duckdb ``` @@ -183,7 +183,7 @@ For more information, read the guide on [how to add a verified source.](../../wa information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: - ```python + ```py #Zendesk support credentials [sources.zendesk.credentials] subdomain = "subdomain" # Zendesk subdomain @@ -215,20 +215,20 @@ For more information, read the [General Usage: Credentials.](../../general-usage 1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command: - ```bash + ```sh pip install -r requirements.txt ``` 1. You're now ready to run the pipeline! To get started, run the following command: - ```bash + ```sh python zendesk_pipeline.py ``` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: - ```bash + ```sh dlt pipeline show ``` @@ -246,13 +246,14 @@ For more information, read the guide on [how to run a pipeline](../../walkthroug This function retrieves data from Zendesk Talk for phone calls and voicemails. -```python +```py @dlt.source(max_table_nesting=2) def zendesk_talk( credentials: TZendeskCredentials = dlt.secrets.value, start_date: Optional[TAnyDateTime] = DEFAULT_START_DATE, end_date: Optional[TAnyDateTime] = None, ) -> Iterable[DltResource]: + ... ``` `credentials`: Authentication credentials. @@ -266,13 +267,14 @@ run. This function loads data from Zendesk talk endpoint. -```python +```py def talk_resource( zendesk_client: ZendeskAPIClient, talk_endpoint_name: str, talk_endpoint: str, pagination_type: PaginationType, ) -> Iterator[TDataItem]: + ... ``` `zendesk_client`: An instance of ZendeskAPIClient for making API calls to Zendesk Talk. @@ -305,7 +307,7 @@ verified source. 1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows: - ```python + ```py pipeline = dlt.pipeline( pipeline_name="dlt_zendesk_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) @@ -315,7 +317,7 @@ verified source. 1. To load data related to support, talk and chat: - ```python + ```py #zendesk support source function data_support = zendesk_support(load_all=True) # zendesk chat source function @@ -324,23 +326,23 @@ verified source. data_talk = zendesk_talk() # run pipeline with all 3 sources info = pipeline.run([data_support,data_chat,data_talk]) - return info + print(info) ``` 1. To load data related to support, chat and talk in incremental mode: - ```python - pipeline = dlt.pipeline( - pipeline_name="dlt_zendesk_pipeline", # Use a custom name if desired - destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) - full_refresh = Fasle - dataset_name="sample_zendesk_data" # Use a custom name if desired + ```py + pipeline = dlt.pipeline( + pipeline_name="dlt_zendesk_pipeline", # Use a custom name if desired + destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post) + full_refresh = False, + dataset_name="sample_zendesk_data" # Use a custom name if desired ) - data = zendesk_support(load_all=True, start_date=start_date) - data_chat = zendesk_chat(start_date=start_date) - data_talk = zendesk_talk(start_date=start_date) - info = pipeline.run(data=[data, data_chat, data_talk]) - return info + data = zendesk_support(load_all=True, start_date=start_date) + data_chat = zendesk_chat(start_date=start_date) + data_talk = zendesk_talk(start_date=start_date) + info = pipeline.run(data=[data, data_chat, data_talk]) + print(info) ``` > Supports incremental loading for Support, Chat, and Talk Endpoints. By default, it fetches data @@ -350,7 +352,7 @@ verified source. 1. To load historical data in weekly ranges from Jan 1st, 2023, then switch to incremental loading for new tickets. - ```python + ```py # Load ranges of dates to load between January 1st 2023 and today min_start_date = pendulum.DateTime(year=2023, month=1, day=1).in_timezone("UTC") max_end_date = pendulum.today() diff --git a/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md b/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md index c61805423b..ffe0abd082 100644 --- a/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md +++ b/docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md @@ -12,7 +12,7 @@ To do so, run the [cli command](../../reference/command-line-interface.md#show-t below with your pipeline name. The pipeline name is the name of the Python file where your pipeline is defined and also displayed in your terminal when loading: -```bash +```sh dlt pipeline {pipeline_name} show ``` @@ -33,7 +33,7 @@ pipeline and hide many intricacies of correctly setting up the connection to you Execute any SQL query and get results following the Python [dbapi](https://peps.python.org/pep-0249/) spec. Below we fetch data from the customers table: -```python +```py pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") with pipeline.sql_client() as client: with client.execute_query( @@ -54,7 +54,7 @@ natively (i.e. BigQuery and DuckDB), `dlt` uses the native method. Thanks to tha frames may be really fast! The example below reads GitHub reactions data from the `issues` table and counts reaction types. -```python +```py pipeline = dlt.pipeline( pipeline_name="github_pipeline", destination="duckdb", @@ -79,14 +79,14 @@ The native connection to your destination like BigQuery `Client` or DuckDB `Duck available in case you want to do anything special. Below we take the native connection to `duckdb` to get `DuckDBPyRelation` from a query: -```python +```py import dlt import duckdb pipeline = dlt.pipeline(destination="duckdb", dataset_name="github_reactions") with pipeline.sql_client() as client: conn = client.native_connection - rel = conn.sql('SELECT * FROM issues'); + rel = conn.sql('SELECT * FROM issues') rel.limit(3).show() ``` diff --git a/docs/website/docs/examples/chess_production/index.md b/docs/website/docs/examples/chess_production/index.md index d80558e745..ac305e943b 100644 --- a/docs/website/docs/examples/chess_production/index.md +++ b/docs/website/docs/examples/chess_production/index.md @@ -179,7 +179,7 @@ def load_data_with_retry(pipeline, data): :::warning To run this example you need to provide Slack incoming hook in `.dlt/secrets.toml`: -```python +```py [runtime] slack_incoming_hook="https://hooks.slack.com/services/***" ``` diff --git a/docs/website/docs/examples/google_sheets/index.md b/docs/website/docs/examples/google_sheets/index.md index 4af35f6dac..3bf3f858d8 100644 --- a/docs/website/docs/examples/google_sheets/index.md +++ b/docs/website/docs/examples/google_sheets/index.md @@ -27,7 +27,7 @@ This example is for educational purposes. For best practices, we recommend using ### Install Google client library -```shell +```sh pip install google-api-python-client ``` diff --git a/docs/website/docs/examples/nested_data/index.md b/docs/website/docs/examples/nested_data/index.md index b2b5ee2792..8a5c17604c 100644 --- a/docs/website/docs/examples/nested_data/index.md +++ b/docs/website/docs/examples/nested_data/index.md @@ -26,7 +26,7 @@ We'll learn how to: ### Install pymongo -```shell +```sh pip install pymongo>=4.3.3 ``` diff --git a/docs/website/docs/examples/pdf_to_weaviate/index.md b/docs/website/docs/examples/pdf_to_weaviate/index.md index cc2ef01e33..5b889b858d 100644 --- a/docs/website/docs/examples/pdf_to_weaviate/index.md +++ b/docs/website/docs/examples/pdf_to_weaviate/index.md @@ -14,7 +14,7 @@ import Header from '../_examples-header.md'; Additionally we'll use PyPDF2 to extract text from PDFs. Make sure you have it installed: -```shell +```sh pip install PyPDF2 ``` diff --git a/docs/website/docs/examples/qdrant_zendesk/index.md b/docs/website/docs/examples/qdrant_zendesk/index.md index 7920619b26..b71840073b 100644 --- a/docs/website/docs/examples/qdrant_zendesk/index.md +++ b/docs/website/docs/examples/qdrant_zendesk/index.md @@ -28,7 +28,7 @@ First, configure the destination credentials for [Qdrant](https://dlthub.com/doc Next, make sure you have the following dependencies installed: -```commandline +```sh pip install qdrant-client>=1.6.9 pip install fastembed>=0.1.1 ``` @@ -170,13 +170,13 @@ response = qdrant_client.query( The query above gives stores the following results in the `response` variable: -```json +```py [QueryResponse(id='6aeacd21-b3d0-5174-97ef-5aaa59486414', embedding=None, metadata={'_dlt_id': 'Nx3wBiL29xTgaQ', '_dlt_load_id': '1700130284.002391', 'allow_attachments': True, 'allow_channelback': False, 'assignee_id': 12765072569105, 'brand_id': 12765073054225, 'created_at': '2023-09-01T11:19:25+00:00', 'custom_status_id': 12765028278545, 'description': 'I have been trying to cancel my subscription but the system won’t let me do it. Can you please help?', 'from_messaging_channel': False, 'generated_timestamp': 1693567167, 'group_id': 12765036328465, 'has_incidents': False, 'id': 12, 'is_public': True, 'organization_id': 12765041119505, 'raw_subject': 'Unable to Cancel Subscription', 'requester_id': 12765072569105, 'status': 'open', 'subject': 'Unable to Cancel Subscription', 'submitter_id': 12765072569105, 'tags': ['test1'], 'test_field': 'test1', 'ticket_form_id': 12765054772497, 'updated_at': '2023-09-01T11:19:25+00:00', 'url': 'https://d3v-dlthub.zendesk.com/api/v2/tickets/12.json', 'via__channel': 'web'}, document='', score=0.89545774), QueryResponse(id='a22189c1-70ab-5421-938b-1caae3e7d6d8', embedding=None, metadata={'_dlt_id': 'bc/xloksL89EUg', '_dlt_load_id': '1700130284.002391', 'allow_attachments': True, 'allow_channelback': False, 'assignee_id': 12765072569105, 'brand_id': 12765073054225, 'created_at': '2023-07-18T17:23:42+00:00', 'custom_status_id': 12765028278545, 'description': 'ABCDEF', 'from_messaging_channel': False, 'generated_timestamp': 1689701023, 'group_id': 12765036328465, 'has_incidents': False, 'id': 4, 'is_public': True, 'organization_id': 12765041119505, 'raw_subject': 'What is this ticket', 'requester_id': 12765072569105, 'status': 'open', 'subject': 'What is this ticket', 'submitter_id': 12765072569105, 'tags': ['test1'], 'test_field': 'test1', 'ticket_form_id': 12765054772497, 'updated_at': '2023-07-18T17:23:42+00:00', 'url': 'https://d3v-dlthub.zendesk.com/api/v2/tickets/4.json', 'via__channel': 'web'}, document='', score=0.8643349), QueryResponse(id='ce2f1c5c-41c3-56c3-a31d-2399a7a9239d', embedding=None, metadata={'_dlt_id': 'ZMuFJZo0AJxV4A', '_dlt_load_id': '1700130284.002391', 'allow_attachments': True, 'allow_channelback': False, 'assignee_id': 12765072569105, 'brand_id': 12765073054225, 'created_at': '2023-03-14T10:52:28+00:00', 'custom_status_id': 12765028278545, 'description': 'X', 'from_messaging_channel': False, 'generated_timestamp': 1696163084, 'group_id': 12765036328465, 'has_incidents': False, 'id': 2, 'is_public': True, 'priority': 'high', 'raw_subject': 'SCRUBBED', 'requester_id': 13726460510097, 'status': 'deleted', 'subject': 'SCRUBBED', 'submitter_id': 12765072569105, 'tags': [], 'ticket_form_id': 13726337882769, 'type': 'question', 'updated_at': '2023-09-01T12:10:35+00:00', 'url': 'https://d3v-dlthub.zendesk.com/api/v2/tickets/2.json', 'via__channel': 'web'}, document='', score=0.8467072)] ``` To get a closer look at what the Zendesk ticket was, and how dlt dealt with it, we can index into the metadata of the first `QueryResponse` object: -```json lines +```py {'_dlt_id': 'Nx3wBiL29xTgaQ', '_dlt_load_id': '1700130284.002391', 'allow_attachments': True, diff --git a/docs/website/docs/general-usage/credentials/config_providers.md b/docs/website/docs/general-usage/credentials/config_providers.md index 860370d38a..cf23b5d5dc 100644 --- a/docs/website/docs/general-usage/credentials/config_providers.md +++ b/docs/website/docs/general-usage/credentials/config_providers.md @@ -38,7 +38,7 @@ providers. ### Example -```python +```py @dlt.source def google_sheets( spreadsheet_id=dlt.config.value, @@ -133,7 +133,7 @@ current Working Directory**. Example: If your working directory is `my_dlt_project` and your project has the following structure: -``` +```text my_dlt_project: | pipelines/ diff --git a/docs/website/docs/general-usage/credentials/config_specs.md b/docs/website/docs/general-usage/credentials/config_specs.md index 07e56b3e14..e93e1c466a 100644 --- a/docs/website/docs/general-usage/credentials/config_specs.md +++ b/docs/website/docs/general-usage/credentials/config_specs.md @@ -21,7 +21,7 @@ service account credentials, while `ConnectionStringCredentials` handles databas As an example, let's use `ConnectionStringCredentials` which represents a database connection string. -```python +```py from dlt.sources.credentials import ConnectionStringCredentials @dlt.source @@ -60,17 +60,17 @@ dsn.password="loader" You can explicitly provide credentials in various forms: -```python +```py query("SELECT * FROM customers", "postgres://loader@localhost:5432/dlt_data") # or -query("SELECT * FROM customers", {"database": "dlt_data", "username": "loader"...}) +query("SELECT * FROM customers", {"database": "dlt_data", "username": "loader"}) ``` ## Built in credentials We have some ready-made credentials you can reuse: -```python +```py from dlt.sources.credentials import ConnectionStringCredentials from dlt.sources.credentials import OAuth2Credentials from dlt.sources.credentials import GcpServiceAccountCredentials, GcpOAuthCredentials @@ -87,7 +87,7 @@ and additional query parameters. This class provides methods for parsing and generating connection strings. #### Usage -```python +```py credentials = ConnectionStringCredentials() # Set the necessary attributes @@ -117,7 +117,7 @@ client secret, refresh token, and access token. It also allows for the addition of scopes and provides methods for client authentication. Usage: -```python +```py credentials = OAuth2Credentials( client_id="CLIENT_ID", client_secret="CLIENT_SECRET", @@ -153,7 +153,7 @@ This class provides methods to retrieve native credentials for Google clients. - You may just pass the `service.json` as string or dictionary (in code and via config providers). - Or default credentials will be used. -```python +```py credentials = GcpServiceAccountCredentials() # Parse a native value (ServiceAccountCredentials) # Accepts a native value, which can be either an instance of ServiceAccountCredentials @@ -163,7 +163,7 @@ native_value = {"private_key": ".."} # or "path/to/services.json" credentials.parse_native_representation(native_value) ``` or more preferred use: -```python +```py import dlt from dlt.sources.credentials import GcpServiceAccountCredentials @@ -204,7 +204,7 @@ serialized OAuth client secrets JSON. This class provides methods for authentication and obtaining access tokens. ##### Usage -```python +```py oauth_credentials = GcpOAuthCredentials() # Accepts a native value, which can be either an instance of GoogleOAuth2Credentials @@ -214,7 +214,7 @@ native_value_oauth = {"client_secret": ...} oauth_credentials.parse_native_representation(native_value_oauth) ``` or more preferred use: -```python +```py import dlt from dlt.sources.credentials import GcpOAuthCredentials @@ -277,7 +277,7 @@ It inherits the ability to manage default credentials and extends it with method for handling partial credentials and converting credentials to a botocore session. #### Usage -```python +```py credentials = AwsCredentials() # Set the necessary attributes credentials.aws_access_key_id = "ACCESS_KEY_ID" @@ -285,7 +285,7 @@ credentials.aws_secret_access_key = "SECRET_ACCESS_KEY" credentials.region_name = "us-east-1" ``` or -```python +```py # Imports an external boto3 session and sets the credentials properties accordingly. import botocore.session @@ -295,7 +295,7 @@ credentials.parse_native_representation(session) print(credentials.aws_access_key_id) ``` or more preferred use: -```python +```py @dlt.source def aws_readers( bucket_url: str = dlt.config.value, @@ -340,14 +340,14 @@ handling partial credentials and converting credentials to a format suitable for interacting with Azure Blob Storage using the adlfs library. #### Usage -```python +```py credentials = AzureCredentials() # Set the necessary attributes credentials.azure_storage_account_name = "ACCOUNT_NAME" credentials.azure_storage_account_key = "ACCOUNT_KEY" ``` or more preferred use: -```python +```py @dlt.source def azure_readers( bucket_url: str = dlt.config.value, @@ -388,7 +388,7 @@ decorated function. Example: -```python +```py @dlt.source def zen_source(credentials: Union[ZenApiKeyCredentials, ZenEmailCredentials, str] = dlt.secrets.value, some_option: bool = False): # depending on what the user provides in config, ZenApiKeyCredentials or ZenEmailCredentials will be injected in `credentials` argument @@ -432,7 +432,7 @@ This is used a lot in the `dlt` core and may become useful for complicated sourc In fact, for each decorated function a spec is synthesized. In case of `google_sheets` following class is created: -```python +```py from dlt.sources.config import configspec, with_config @configspec diff --git a/docs/website/docs/general-usage/credentials/configuration.md b/docs/website/docs/general-usage/credentials/configuration.md index 9b2d392883..ec8e5fe32a 100644 --- a/docs/website/docs/general-usage/credentials/configuration.md +++ b/docs/website/docs/general-usage/credentials/configuration.md @@ -25,7 +25,7 @@ When done right you'll be able to run the same pipeline script during developmen In the example below, the `google_sheets` source function is used to read selected tabs from Google Sheets. It takes several arguments that specify the spreadsheet, the tab names and the Google credentials to be used when extracting data. -```python +```py @dlt.source def google_sheets( spreadsheet_id=dlt.config.value, @@ -68,14 +68,14 @@ You are free to call the function above as usual and pass all the arguments in t Instead let `dlt` to do the work and leave it to [injection mechanism](#injection-mechanism) that looks for function arguments in the config files or environment variables and adds them to your explicit arguments during a function call. Below are two most typical examples: 1. Pass spreadsheet id and tab names in the code, inject credentials from the secrets: - ```python + ```py data_source = google_sheets("23029402349032049", ["tab1", "tab2"]) ``` `credentials` value will be injected by the `@source` decorator (e.g. from `secrets.toml`). `spreadsheet_id` and `tab_names` take values from the call arguments. 2. Inject all the arguments from config / secrets - ```python + ```py data_source = google_sheets() ``` `credentials` value will be injected by the `@source` decorator (e.g. from **secrets.toml**). @@ -97,16 +97,16 @@ Where do the configs and secrets come from? By default, `dlt` looks in two **con Secrets in **.dlt/secrets.toml**. `dlt` will look for `credentials`, ```toml [credentials] - client_email = - private_key = - project_id = + client_email = "" + private_key = "" + project_id = "" ``` Note that **credentials** will be evaluated as dictionary containing **client_email**, **private_key** and **project_id** as keys. It is standard TOML behavior. - [Environment Variables](config_providers#environment-provider): - ```python - CREDENTIALS= - SPREADSHEET_ID=1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580 - TAB_NAMES=tab1,tab2 + ```toml + CREDENTIALS="" + SPREADSHEET_ID="1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580" + TAB_NAMES=["tab1", "tab2"] ``` We pass the JSON contents of `service.json` file to `CREDENTIALS` and we specify tab names as comma-delimited values. Environment variables are always in **upper case**. @@ -123,7 +123,7 @@ There are many ways you can organize your configs and secrets. The example above ### Do not hardcode secrets You should never do that. Sooner or later your private key will leak. -```python +```py # WRONG!: # provide all values directly - wrong but possible. # secret values should never be present in the code! @@ -137,7 +137,7 @@ data_source = google_sheets( ### Pass secrets in code from external providers You can get the secret values from your own providers. Below we take **credentials** for our `google_sheets` source from Airflow base hook: -```python +```py from airflow.hooks.base_hook import BaseHook # get it from airflow connections or other credential store @@ -163,7 +163,7 @@ Doing so provides several benefits: 1. You can request [built-in and custom credentials](config_specs.md) (i.e. connection strings, AWS / GCP / Azure credentials). 1. You can specify a set of possible types via `Union` i.e. OAuth or API Key authorization. -```python +```py @dlt.source def google_sheets( spreadsheet_id: str = dlt.config.value, @@ -171,7 +171,7 @@ def google_sheets( credentials: GcpServiceAccountCredentials = dlt.secrets.value, only_strings: bool = False ): - ... + ... ``` Now: @@ -189,7 +189,7 @@ In case of `GcpServiceAccountCredentials`: ## Read configs and secrets yourself `dlt.secrets` and `dlt.config` provide dictionary-like access to configuration values and secrets, respectively. -```python +```py # use `dlt.secrets` and `dlt.config` to explicitly take # those values from providers from the explicit keys data_source = google_sheets( @@ -202,14 +202,14 @@ data_source.run(destination="bigquery") ``` `dlt.config` and `dlt.secrets` behave like dictionaries from which you can request a value with any key name. `dlt` will look in all [config providers](#injection-mechanism) - TOML files, env variables etc. just like it does with the standard section layout. You can also use `dlt.config.get()` or `dlt.secrets.get()` to request value cast to a desired type. For example: -```python +```py credentials = dlt.secrets.get("my_section.gcp_credentials", GcpServiceAccountCredentials) ``` Creates `GcpServiceAccountCredentials` instance out of values (typically a dictionary) under **my_section.gcp_credentials** key. ### Write configs and secrets in code **dlt.config** and **dlt.secrets** can be also used as setters. For example: -```python +```py dlt.config["sheet_id"] = "23029402349032049" dlt.secrets["destination.postgres.credentials"] = BaseHook.get_connection('postgres_dsn').extra ``` @@ -263,9 +263,9 @@ Here is the simplest default layout for our `google_sheets` example. ```toml [credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" ``` **config.toml** @@ -284,9 +284,9 @@ This makes sure that `google_sheets` source does not share any secrets and confi ```toml [sources.google_sheets.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" ``` **config.toml** @@ -305,9 +305,9 @@ Use this if you want to read and pass the config/secrets yourself ```toml [my_section] - [my_section.gcp_credentials] - client_email = - private_key = +[my_section.gcp_credentials] +client_email = "" +private_key = "" ``` **config.toml** @@ -316,9 +316,9 @@ Use this if you want to read and pass the config/secrets yourself [my_section] tabs=["tab1", "tab2"] - [my_section.gcp_credentials] - # I prefer to keep my project id in config file and private key in secrets - project_id = +[my_section.gcp_credentials] +# I prefer to keep my project id in config file and private key in secrets +project_id = "" ``` ### Default layout and default key lookup during injection @@ -328,7 +328,7 @@ makes it easy to configure simple cases but also provides a room for more explic complex cases i.e. having several sources with different credentials or even hosting several pipelines in the same project sharing the same config and credentials. -``` +```text pipeline_name | |-sources @@ -368,15 +368,15 @@ Example: We use the `bigquery` destination and the `google_sheets` source. They ```toml # google sheet credentials [sources.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" # bigquery credentials [destination.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" ``` Now when `dlt` looks for destination credentials, it will start with `destination.bigquery.credentials`, eliminate `bigquery` and stop at `destination.credentials`. @@ -388,21 +388,21 @@ Example: let's be even more explicit and use a full section path possible. ```toml # google sheet credentials [sources.google_sheets.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" # google analytics credentials [sources.google_analytics.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" # bigquery credentials [destination.bigquery.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" ``` Now we can separate credentials for different sources as well. @@ -418,18 +418,18 @@ Example: the pipeline is named `ML_sheets`. ```toml [ML_sheets.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" ``` or maximum path: ```toml [ML_sheets.sources.google_sheets.credentials] -client_email = -private_key = -project_id = +client_email = "" +private_key = "" +project_id = "" ``` ### The `sources` section @@ -455,7 +455,7 @@ Now we can finally understand the `ConfigFieldMissingException`. Let's run `chess.py` example without providing the password: -``` +```sh $ CREDENTIALS="postgres://loader@localhost:5432/dlt_data" python chess.py ... dlt.common.configuration.exceptions.ConfigFieldMissingException: Following fields are missing: ['password'] in configuration with spec PostgresCredentials diff --git a/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md b/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md index 3f665bd0fb..ba0b13636b 100644 --- a/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md +++ b/docs/website/docs/general-usage/customising-pipelines/pseudonymizing_columns.md @@ -11,7 +11,7 @@ consistently achieve the same mapping. If instead you wish to anonymize, you can replace it with a constant. In the example below, we create a dummy source with a PII column called "name", which we replace with deterministic hashes (i.e. replacing the German umlaut). -```python +```py import dlt import hashlib diff --git a/docs/website/docs/general-usage/customising-pipelines/removing_columns.md b/docs/website/docs/general-usage/customising-pipelines/removing_columns.md index 8493ffaec5..3163062ced 100644 --- a/docs/website/docs/general-usage/customising-pipelines/removing_columns.md +++ b/docs/website/docs/general-usage/customising-pipelines/removing_columns.md @@ -14,7 +14,7 @@ Let's create a sample pipeline demonstrating the process of removing a column. 1. Create a source function that creates dummy data as follows: - ```python + ```py import dlt # This function creates a dummy data source. @@ -31,7 +31,7 @@ Let's create a sample pipeline demonstrating the process of removing a column. 1. Next, create a function to filter out columns from the data before loading it into a database as follows: - ```python + ```py from typing import Dict, List, Optional def remove_columns(doc: Dict, remove_columns: Optional[List[str]] = None) -> Dict: @@ -53,7 +53,7 @@ Let's create a sample pipeline demonstrating the process of removing a column. 1. Next, declare the columns to be removed from the table, and then modify the source as follows: - ```python + ```py # Example columns to remove: remove_columns_list = ["country_code"] @@ -67,7 +67,7 @@ Let's create a sample pipeline demonstrating the process of removing a column. ``` 1. You can optionally inspect the result: - ```python + ```py for row in data_source: print(row) #{'id': 0, 'name': 'Jane Washington 0'} @@ -77,7 +77,7 @@ Let's create a sample pipeline demonstrating the process of removing a column. 1. At last, create a pipeline: - ```python + ```py # Integrating with a DLT pipeline pipeline = dlt.pipeline( pipeline_name='example', diff --git a/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md b/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md index e58dae6d9d..04e4d33b13 100644 --- a/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md +++ b/docs/website/docs/general-usage/customising-pipelines/renaming_columns.md @@ -12,7 +12,7 @@ In the example below, we create a dummy source with special characters in the na function that we intend to apply to the resource to modify its output (i.e. replacing the German umlaut): `replace_umlauts_in_dict_keys`. -```python +```py import dlt # create a dummy source with umlauts (special characters) in key names (um) diff --git a/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md b/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md index 6b09510f68..f8bd179422 100644 --- a/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/currency_conversion_data_enrichment.md @@ -77,7 +77,7 @@ currency_conversion_enrichment/ 1. Here's the resource that yields the sample data as discussed above: - ```python + ```py @dlt.resource() def enriched_data_part_two(): data_enrichment_part_one = [ @@ -113,14 +113,14 @@ API token. information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: - ```python + ```py [sources] api_key= "Please set me up!" #ExchangeRate-API key ``` 1. Create the `converted_amount` function as follows: - ```python + ```py # @transformer(data_from=enriched_data_part_two) def converted_amount(record): """ @@ -210,7 +210,7 @@ API token. 1. Here, we create the pipeline and use the `add_map` functionality: - ```python + ```py # Create the pipeline pipeline = dlt.pipeline( pipeline_name="data_enrichment_two", @@ -229,7 +229,7 @@ API token. To do so, you need to add the transformer decorator at the top of the `converted_amount` function. For `pipeline.run`, you can use the following code: - ```python + ```py # using fetch_average_price as a transformer function load_info = pipeline.run( enriched_data_part_two | converted_amount, @@ -246,19 +246,19 @@ API token. 1. Install necessary dependencies for the preferred [destination](../../dlt-ecosystem/destinations/), For example, duckdb: - ``` + ```sh pip install dlt[duckdb] ``` 1. Run the pipeline with the following command: - ``` + ```sh python currency_enrichment_pipeline.py ``` 1. To ensure that everything loads as expected, use the command: - ``` + ```sh dlt pipeline show ``` diff --git a/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md b/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md index f4578d065f..ab71d3d1d0 100644 --- a/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/url-parser-data-enrichment.md @@ -29,7 +29,7 @@ you can use any API you prefer. By default the URL Parse API will return a JSON response like: -```text +```json { "authority": "urlparse.com", "domain": "urlparse.com", @@ -73,7 +73,7 @@ understanding, you may explore all three enrichments sequentially in the noteboo Alternatively, to create a data enrichment pipeline, you can start by creating the following directory structure: -```python +```text url_parser_enrichment/ ├── .dlt/ │ └── secrets.toml @@ -100,41 +100,41 @@ Let's examine a synthetic dataset created for this article. It includes: Here's the resource that yields the sample data as discussed above: -```python - import dlt +```py + import dlt - @dlt.resource(write_disposition="append") - def tracked_data(): - """ - A generator function that yields a series of dictionaries, each representing - user tracking data. + @dlt.resource(write_disposition="append") + def tracked_data(): + """ + A generator function that yields a series of dictionaries, each representing + user tracking data. - This function is decorated with `dlt.resource` to integrate into the DLT (Data - Loading Tool) pipeline. The `write_disposition` parameter is set to "append" to - ensure that data from this generator is appended to the existing data in the - destination table. + This function is decorated with `dlt.resource` to integrate into the DLT (Data + Loading Tool) pipeline. The `write_disposition` parameter is set to "append" to + ensure that data from this generator is appended to the existing data in the + destination table. - Yields: - dict: A dictionary with keys 'user_id', 'device_name', and 'page_referer', - representing the user's tracking data including their device and the page - they were referred from. - """ + Yields: + dict: A dictionary with keys 'user_id', 'device_name', and 'page_referer', + representing the user's tracking data including their device and the page + they were referred from. + """ - # Sample data representing tracked user data - sample_data = [ + # Sample data representing tracked user data + sample_data = [ { "user_id": 1, "device_name": "Sony Experia XZ", "page_referer": "https://b2venture.lightning.force.com/" }, - """ - Data for other users - """ - ] - - # Yielding each user's data as a dictionary - for user_data in sample_data: - yield user_data + """ + Data for other users + """ + ] + + # Yielding each user's data as a dictionary + for user_data in sample_data: + yield user_data ``` ### 2. Create `url_parser` function @@ -143,7 +143,7 @@ We use a free service called [URL Parse API](https://urlparse.com/), to parse th need to register to use this service neither get an API key. 1. Create a `url_parser` function as follows: - ```python + ```py # @dlt.transformer(data_from=tracked_data) def url_parser(record): """ @@ -195,7 +195,7 @@ need to register to use this service neither get an API key. 1. Here, we create the pipeline and use the `add_map` functionality: - ```python + ```py # Create the pipeline pipeline = dlt.pipeline( pipeline_name="data_enrichment_three", @@ -214,7 +214,7 @@ need to register to use this service neither get an API key. do so, you need to add the transformer decorator at the top of the `url_parser` function. For `pipeline.run`, you can use the following code: - ```python + ```py # using fetch_average_price as a transformer function load_info = pipeline.run( tracked_data | url_parser, @@ -230,19 +230,19 @@ need to register to use this service neither get an API key. 1. Install necessary dependencies for the preferred [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), For example, duckdb: - ``` + ```sh pip install dlt[duckdb] ``` 1. Run the pipeline with the following command: - ``` + ```sh python url_enrichment_pipeline.py ``` 1. To ensure that everything loads as expected, use the command: - ``` + ```sh dlt pipeline show ``` diff --git a/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md b/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md index 8b33a852a8..6b07845689 100644 --- a/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md +++ b/docs/website/docs/general-usage/data-enrichments/user_agent_device_data_enrichment.md @@ -41,7 +41,7 @@ Here's the link to the notebook: ### B. Create a pipeline Alternatively, to create a data enrichment pipeline, you can start by creating the following directory structure: -```python +```text user_device_enrichment/ ├── .dlt/ │ └── secrets.toml @@ -67,42 +67,42 @@ user_device_enrichment/ Here's the resource that yields the sample data as discussed above: - ```python - import dlt - - @dlt.resource(write_disposition="append") - def tracked_data(): - """ - A generator function that yields a series of dictionaries, each representing - user tracking data. - - This function is decorated with `dlt.resource` to integrate into the DLT (Data - Loading Tool) pipeline. The `write_disposition` parameter is set to "append" to - ensure that data from this generator is appended to the existing data in the - destination table. - - Yields: - dict: A dictionary with keys 'user_id', 'device_name', and 'page_referer', - representing the user's tracking data including their device and the page - they were referred from. - """ - - # Sample data representing tracked user data - sample_data = [ - {"user_id": 1, "device_name": "Sony Experia XZ", "page_referer": - "https://b2venture.lightning.force.com/"}, - {"user_id": 2, "device_name": "Samsung Galaxy S23 Ultra 5G", - "page_referer": "https://techcrunch.com/2023/07/20/can-dlthub-solve-the-python-library-problem-for-ai-dig-ventures-thinks-so/"}, - {"user_id": 3, "device_name": "Apple iPhone 14 Pro Max", - "page_referer": "https://dlthub.com/success-stories/freelancers-perspective/"}, - {"user_id": 4, "device_name": "OnePlus 11R", - "page_referer": "https://www.reddit.com/r/dataengineering/comments/173kp9o/ideas_for_data_validation_on_data_ingestion/"}, - {"user_id": 5, "device_name": "Google Pixel 7 Pro", "page_referer": "https://pypi.org/"}, - ] - - # Yielding each user's data as a dictionary - for user_data in sample_data: - yield user_data + ```py + import dlt + + @dlt.resource(write_disposition="append") + def tracked_data(): + """ + A generator function that yields a series of dictionaries, each representing + user tracking data. + + This function is decorated with `dlt.resource` to integrate into the DLT (Data + Loading Tool) pipeline. The `write_disposition` parameter is set to "append" to + ensure that data from this generator is appended to the existing data in the + destination table. + + Yields: + dict: A dictionary with keys 'user_id', 'device_name', and 'page_referer', + representing the user's tracking data including their device and the page + they were referred from. + """ + + # Sample data representing tracked user data + sample_data = [ + {"user_id": 1, "device_name": "Sony Experia XZ", "page_referer": + "https://b2venture.lightning.force.com/"}, + {"user_id": 2, "device_name": "Samsung Galaxy S23 Ultra 5G", + "page_referer": "https://techcrunch.com/2023/07/20/can-dlthub-solve-the-python-library-problem-for-ai-dig-ventures-thinks-so/"}, + {"user_id": 3, "device_name": "Apple iPhone 14 Pro Max", + "page_referer": "https://dlthub.com/success-stories/freelancers-perspective/"}, + {"user_id": 4, "device_name": "OnePlus 11R", + "page_referer": "https://www.reddit.com/r/dataengineering/comments/173kp9o/ideas_for_data_validation_on_data_ingestion/"}, + {"user_id": 5, "device_name": "Google Pixel 7 Pro", "page_referer": "https://pypi.org/"}, + ] + + # Yielding each user's data as a dictionary + for user_data in sample_data: + yield user_data ``` ### 2. Create `fetch_average_price` function @@ -118,7 +118,7 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the information securely, like access tokens. Keep this file safe. Here's its format for service account authentication: - ```python + ```py [sources] api_key= "Please set me up!" #Serp Api key. ``` @@ -126,7 +126,7 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the 1. Replace the value of the `api_key`. 1. Create `fetch_average_price()` function as follows: - ```python + ```py import datetime import requests @@ -247,7 +247,7 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the 1. Here, we create the pipeline and use the `add_map` functionality: - ```python + ```py # Create the pipeline pipeline = dlt.pipeline( pipeline_name="data_enrichment_one", @@ -266,7 +266,7 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the do so, you need to add the transformer decorator at the top of the `fetch_average_price` function. For `pipeline.run`, you can use the following code: - ```python + ```py # using fetch_average_price as a transformer function load_info = pipeline.run( tracked_data | fetch_average_price, @@ -283,19 +283,19 @@ The first step is to register on [SerpAPI](https://serpapi.com/) and obtain the 1. Install necessary dependencies for the preferred [destination](https://dlthub.com/docs/dlt-ecosystem/destinations/), For example, duckdb: - ``` + ```sh pip install dlt[duckdb] ``` 1. Run the pipeline with the following command: - ``` + ```sh python device_enrichment_pipeline.py ``` 1. To ensure that everything loads as expected, use the command: - ``` + ```sh dlt pipeline show ``` diff --git a/docs/website/docs/general-usage/destination.md b/docs/website/docs/general-usage/destination.md index c20aa62d16..b45ef39f3f 100644 --- a/docs/website/docs/general-usage/destination.md +++ b/docs/website/docs/general-usage/destination.md @@ -75,7 +75,7 @@ azure_storage_account_key="storage key" ``` or via environment variables: -``` +```sh DESTINATION__FILESYSTEM__BUCKET_URL=az://dlt-azure-bucket DESTINATION__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME=dltdata DESTINATION__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY="storage key" diff --git a/docs/website/docs/general-usage/full-loading.md b/docs/website/docs/general-usage/full-loading.md index 4651d156f0..320d0664f5 100644 --- a/docs/website/docs/general-usage/full-loading.md +++ b/docs/website/docs/general-usage/full-loading.md @@ -13,7 +13,7 @@ that are not selected while performing a full load will not replace any data in To perform a full load on one or more of your resources, choose the `write_disposition='replace'` for this resource: -```python +```py p = dlt.pipeline(destination="bigquery", dataset_name="github") issues = [] reactions = ["%2B1", "-1", "smile", "tada", "thinking_face", "heart", "rocket", "eyes"] diff --git a/docs/website/docs/general-usage/incremental-loading.md b/docs/website/docs/general-usage/incremental-loading.md index 144b176332..b815512070 100644 --- a/docs/website/docs/general-usage/incremental-loading.md +++ b/docs/website/docs/general-usage/incremental-loading.md @@ -64,7 +64,7 @@ child tables. Example below loads all the GitHub events and updates them in the destination using "id" as primary key, making sure that only a single copy of event is present in `github_repo_events` table: -```python +```py @dlt.resource(primary_key="id", write_disposition="merge") def github_repo_events(): yield from _get_event_pages() @@ -72,26 +72,28 @@ def github_repo_events(): You can use compound primary keys: -```python +```py @dlt.resource(primary_key=("id", "url"), write_disposition="merge") -... +def resource(): + ... ``` By default, `primary_key` deduplication is arbitrary. You can pass the `dedup_sort` column hint with a value of `desc` or `asc` to influence which record remains after deduplication. Using `desc`, the records sharing the same `primary_key` are sorted in descending order before deduplication, making sure the record with the highest value for the column with the `dedup_sort` hint remains. `asc` has the opposite behavior. -```python +```py @dlt.resource( primary_key="id", write_disposition="merge", columns={"created_at": {"dedup_sort": "desc"}} # select "latest" record ) -... +def resource(): + ... ``` Example below merges on a column `batch_day` that holds the day for which given record is valid. Merge keys also can be compound: -```python +```py @dlt.resource(merge_key="batch_day", write_disposition="merge") def get_daily_batch(day): yield _get_batch_from_bucket(day) @@ -101,7 +103,7 @@ As with any other write disposition you can use it to load data ad hoc. Below we top reactions for `duckdb` repo. The lists have, obviously, many overlapping issues, but we want to keep just one instance of each. -```python +```py p = dlt.pipeline(destination="bigquery", dataset_name="github") issues = [] reactions = ["%2B1", "-1", "smile", "tada", "thinking_face", "heart", "rocket", "eyes"] @@ -117,7 +119,7 @@ Example below dispatches GitHub events to several tables by event type, keeps on by "id" and skips loading of past records using "last value" incremental. As you can see, all of this we can just declare in our resource. -```python +```py @dlt.resource(primary_key="id", write_disposition="merge", table_name=lambda i: i['type']) def github_repo_events(last_created_at = dlt.sources.incremental("created_at", "1970-01-01T00:00:00Z")): """A resource taking a stream of github events and dispatching them to tables named by event type. Deduplicates be 'id'. Loads incrementally by 'created_at' """ @@ -134,7 +136,7 @@ Each record in the destination table with the same `primary_key` or `merge_key` Deletes are propagated to any child table that might exist. For each record that gets deleted in the root table, all corresponding records in the child table(s) will also be deleted. Records in parent and child tables are linked through the `root key` that is explained in the next section. #### Example: with primary key and boolean delete column -```python +```py @dlt.resource( primary_key="id", write_disposition="merge", @@ -157,11 +159,11 @@ def resource(): ``` #### Example: with merge key and non-boolean delete column -```python +```py @dlt.resource( merge_key="id", write_disposition="merge", - columns={"deleted_at_ts": {"hard_delete": True}}} + columns={"deleted_at_ts": {"hard_delete": True}}) def resource(): # this will insert two records yield [ @@ -175,11 +177,11 @@ def resource(): ``` #### Example: with primary key and "dedup_sort" hint -```python +```py @dlt.resource( primary_key="id", write_disposition="merge", - columns={"deleted_flag": {"hard_delete": True}, "lsn": {"dedup_sort": "desc"}} + columns={"deleted_flag": {"hard_delete": True}, "lsn": {"dedup_sort": "desc"}}) def resource(): # this will insert one record (the one with lsn = 3) yield [ @@ -204,7 +206,7 @@ tables. This concept is similar to foreign key which references a parent table, set. We do not enable it everywhere because it takes storage space. Nevertheless, is some cases you may want to permanently enable root key propagation. -```python +```py pipeline = dlt.pipeline( pipeline_name='facebook_insights', destination='duckdb', @@ -243,7 +245,7 @@ Once you've figured that out, `dlt` takes care of finding maximum/minimum cursor duplicates and managing the state with last values of cursor. Take a look at GitHub example below, where we request recently created issues. -```python +```py @dlt.resource(primary_key="id") def repo_issues( access_token, @@ -280,7 +282,7 @@ In the example below we incrementally load the GitHub events, where API does not let us filter for the newest events - it always returns all of them. Nevertheless, `dlt` will load only the new items, filtering out all the duplicates and past issues. -```python +```py # use naming function in table name to generate separate tables for each event @dlt.resource(primary_key="id", table_name=lambda i: i['type']) # type: ignore def repo_events( @@ -309,7 +311,7 @@ and lets you select nested and complex data (including the whole data item when Example below creates last value which is a dictionary holding a max `created_at` value for each created table name: -```python +```py def by_event_type(event): last_value = None if len(event) == 1: @@ -333,7 +335,7 @@ def get_events(last_created_at = dlt.sources.incremental("$", last_value_func=by ### Using `end_value` for backfill You can specify both initial and end dates when defining incremental loading. Let's go back to our Github example: -```python +```py @dlt.resource(primary_key="id") def repo_issues( access_token, @@ -354,7 +356,7 @@ Please note that when `end_date` is specified, `dlt` **will not modify the exist To define specific ranges to load, you can simply override the incremental argument in the resource, for example: -```python +```py july_issues = repo_issues( created_at=dlt.sources.incremental( initial_value='2022-07-01T00:00:00Z', end_value='2022-08-01T00:00:00Z' @@ -399,7 +401,7 @@ The github events example is exactly such case. The results are ordered on curso In the same fashion the `row_order` can be used to **optimize backfill** so we don't continue making unnecessary API requests after the end of range is reached. For example: -```python +```py @dlt.resource(primary_key="id") def tickets( zendesk_client, @@ -432,7 +434,7 @@ incremental and exit yield loop when true. The `dlt.sources.incremental` instance provides `start_out_of_range` and `end_out_of_range` attributes which are set when the resource yields an element with a higher/lower cursor value than the initial or end values. If you do not want `dlt` to stop processing automatically and instead to handle such events yourself, do not specify `row_order`: -```python +```py @dlt.transformer(primary_key="id") def tickets( zendesk_client, @@ -472,7 +474,7 @@ deduplicate and which does not become a table hint. The same setting lets you di deduplication altogether when empty tuple is passed. Below we pass `primary_key` directly to `incremental` to disable deduplication. That overrides `delta` primary_key set in the resource: -```python +```py @dlt.resource(primary_key="delta") # disable the unique value check by passing () as primary key to incremental def some_data(last_timestamp=dlt.sources.incremental("item.ts", primary_key=())): @@ -485,7 +487,7 @@ def some_data(last_timestamp=dlt.sources.incremental("item.ts", primary_key=())) When resources are [created dynamically](source.md#create-resources-dynamically) it is possible to use `dlt.sources.incremental` definition as well. -```python +```py @dlt.source def stripe(): # declare a generator function @@ -521,7 +523,7 @@ result in `IncrementalUnboundError` exception. ### Using Airflow schedule for backfill and incremental loading When [running in Airflow task](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md#2-modify-dag-file), you can opt-in your resource to get the `initial_value`/`start_value` and `end_value` from Airflow schedule associated with your DAG. Let's assume that **Zendesk tickets** resource contains a year of data with thousands of tickets. We want to backfill the last year of data week by week and then continue incremental loading daily. -```python +```py @dlt.resource(primary_key="id") def tickets( zendesk_client, @@ -540,7 +542,7 @@ We opt-in to Airflow scheduler by setting `allow_external_schedulers` to `True`: 2. In all other environments, the `incremental` behaves as usual, maintaining `dlt` state. Let's generate a deployment with `dlt deploy zendesk_pipeline.py airflow-composer` and customize the dag: -```python +```py @dag( schedule_interval='@weekly', start_date=pendulum.datetime(2023, 2, 1), @@ -577,7 +579,7 @@ When you enable the DAG in Airflow, it will generate several runs and start exec subsequent weekly intervals starting with `2023-02-12, 00:00:00 UTC` to `2023-02-19, 00:00:00 UTC`. You can repurpose the DAG above to start loading new data incrementally after (or during) the backfill: -```python +```py @dag( schedule_interval='@daily', start_date=pendulum.datetime(2023, 2, 1), @@ -624,7 +626,7 @@ You may force a full refresh of a `merge` and `append` pipelines: Example: -```python +```py p = dlt.pipeline(destination="bigquery", dataset_name="dataset_name") # do a full refresh p.run(merge_source(), write_disposition="replace") @@ -655,7 +657,7 @@ is loaded, the yielded resource data will be loaded at the same time with the up In the two examples below you see how the `dlt.sources.incremental` is working under the hood. -```python +```py @resource() def tweets(): # Get a last value from loaded metadata. If not exist, get None @@ -670,7 +672,7 @@ def tweets(): If we keep a list or a dictionary in the state, we can modify the underlying values in the objects, and thus we do not need to set the state back explicitly. -```python +```py @resource() def tweets(): # Get a last value from loaded metadata. If not exist, get None @@ -708,7 +710,7 @@ data twice - even if the user makes a mistake and requests the same months range In the following example, we initialize a variable with an empty list as a default: -```python +```py @dlt.resource(write_disposition="append") def players_games(chess_url, players, start_month=None, end_month=None): loaded_archives_cache = dlt.current.resource_state().setdefault("archives", []) @@ -734,7 +736,7 @@ def players_games(chess_url, players, start_month=None, end_month=None): ### Advanced state usage: tracking the last value for all search terms in Twitter API -```python +```py @dlt.resource(write_disposition="append") def search_tweets(twitter_bearer_token=dlt.secrets.value, search_terms=None, start_time=None, end_time=None, last_value=None): headers = _headers(twitter_bearer_token) diff --git a/docs/website/docs/general-usage/pipeline.md b/docs/website/docs/general-usage/pipeline.md index 095e03e96d..53eca2e59a 100644 --- a/docs/website/docs/general-usage/pipeline.md +++ b/docs/website/docs/general-usage/pipeline.md @@ -15,7 +15,7 @@ Example: This pipeline will load a list of objects into `duckdb` table with a name "three": -```python +```py import dlt pipeline = dlt.pipeline(destination="duckdb", dataset_name="sequence") @@ -53,7 +53,7 @@ Arguments: Example: This pipeline will load the data the generator `generate_rows(10)` produces: -```python +```py import dlt def generate_rows(nr): @@ -110,7 +110,7 @@ pipeline run is progressing. `dlt` supports 4 progress monitors out of the box: You pass the progress monitor in `progress` argument of the pipeline. You can use a name from the list above as in the following example: -```python +```py # create a pipeline loading chess data that dumps # progress to stdout each 10 seconds (the default) pipeline = dlt.pipeline( @@ -123,7 +123,7 @@ pipeline = dlt.pipeline( You can fully configure the progress monitor. See two examples below: -```python +```py # log each minute to Airflow task logger ti = get_current_context()["ti"] pipeline = dlt.pipeline( @@ -134,7 +134,7 @@ pipeline = dlt.pipeline( ) ``` -```python +```py # set tqdm bar color to yellow pipeline = dlt.pipeline( pipeline_name="chess_pipeline", diff --git a/docs/website/docs/general-usage/resource.md b/docs/website/docs/general-usage/resource.md index 9b8d45982d..e2e95d937f 100644 --- a/docs/website/docs/general-usage/resource.md +++ b/docs/website/docs/general-usage/resource.md @@ -19,7 +19,7 @@ Commonly used arguments: Example: -```python +```py @dlt.resource(name='table_name', write_disposition='replace') def generate_rows(): for i in range(10): @@ -32,7 +32,7 @@ def source_name(): To get the data of a resource, we could do: -```python +```py for row in generate_rows(): print(row) @@ -57,7 +57,7 @@ accepts following arguments: `dlt` that column `tags` (containing a list of tags) in `user` table should have type `complex` which means that it will be loaded as JSON/struct and not as child table. - ```python + ```py @dlt.resource(name="user", columns={"tags": {"data_type": "complex"}}) def get_users(): ... @@ -82,7 +82,7 @@ You can alternatively use a [Pydantic](https://pydantic-docs.helpmanual.io/) mod For example: -```python +```py from pydantic import BaseModel @@ -119,7 +119,7 @@ Things to note: You can override this by configuring the Pydantic model -```python +```py from typing import ClassVar from dlt.common.libs.pydantic import DltConfig @@ -146,7 +146,7 @@ argument and the `table_name` string as a return value. For example, a resource that loads GitHub repository events wants to send `issue`, `pull request`, and `comment` events to separate tables. The type of the event is in the "type" field. -```python +```py # send item to a table with name item["type"] @dlt.resource(table_name=lambda event: event['type']) def repo_events() -> Iterator[TDataItems]: @@ -154,13 +154,13 @@ def repo_events() -> Iterator[TDataItems]: # the `table_schema` method gets table schema generated by a resource and takes optional # data item to evaluate dynamic hints -print(repo_events().table_schema({"type": "WatchEvent", id=...})) +print(repo_events().table_schema({"type": "WatchEvent", id:...})) ``` In more advanced cases, you can dispatch data to different tables directly in the code of the resource function: -```python +```py @dlt.resource def repo_events() -> Iterator[TDataItems]: # mark the "item" to be sent to table with name item["type"] @@ -172,7 +172,7 @@ def repo_events() -> Iterator[TDataItems]: You can add arguments to your resource functions like to any other. Below we parametrize our `generate_rows` resource to generate the number of rows we request: -```python +```py @dlt.resource(name='table_name', write_disposition='replace') def generate_rows(nr): for i in range(nr): @@ -195,7 +195,7 @@ that returns a list of objects (i.e. users) in one endpoint and user details in with this by declaring a resource that obtains a list of users and another resource that receives items from the list and downloads the profiles. -```python +```py @dlt.resource(write_disposition="replace") def users(limit=None): for u in _get_users(limit): @@ -215,7 +215,7 @@ pipeline.run(user_details) ``` In the example above, `user_details` will receive data from default instance of `users` resource (with `limit` set to `None`). You can also use **pipe |** operator to bind resources dynamically -```python +```py # you can be more explicit and use a pipe operator. # with it you can create dynamic pipelines where the dependencies # are set at run time and resources are parametrized i.e. @@ -225,7 +225,7 @@ pipeline.run(users(limit=100) | user_details) :::tip Transformers are allowed not only to **yield** but also to **return** values and can decorate **async** functions and [**async generators**](../reference/performance.md#extract). Below we decorate an async function and request details on two pokemons. Http calls are made in parallel via httpx library. -```python +```py import dlt import httpx @@ -245,7 +245,7 @@ print(list([1,2] | pokemon())) A standalone resource is defined on a function that is top level in a module (not inner function) that accepts config and secrets values. Additionally if `standalone` flag is specified, the decorated function signature and docstring will be preserved. `dlt.resource` will just wrap the decorated function and user must call the wrapper to get the actual resource. Below we declare a `filesystem` resource that must be called before use. -```python +```py @dlt.resource(standalone=True) def filesystem(bucket_url=dlt.config.value): """list and yield files in `bucket_url`""" @@ -256,7 +256,7 @@ pipeline.run(filesystem("s3://my-bucket/reports"), table_name="reports") ``` Standalone may have dynamic name that depends on the arguments passed to the decorated function. For example:: -```python +```py @dlt.resource(standalone=True, name=lambda args: args["stream_name"]) def kinesis(stream_name: str): ... @@ -271,7 +271,7 @@ You can extract multiple resources in parallel threads or with async IO. To enable this for a sync resource you can set the `parallelized` flag to `True` in the resource decorator: -```python +```py @dlt.resource(parallelized=True) def get_users(): for u in _get_users(): @@ -288,7 +288,7 @@ pipeline.run(get_users(), get_orders()) Async generators are automatically extracted concurrently with other resources: -```python +```py @dlt.resource async def get_users(): async for u in _get_users(): # Assuming _get_users is an async generator @@ -317,7 +317,7 @@ so: Here's our resource: -```python +```py import dlt @dlt.resource(write_disposition="replace") @@ -330,7 +330,7 @@ def users(): Here's our script that defines transformations and loads the data: -```python +```py from pipedrive import users def anonymize_user(user_data): @@ -351,7 +351,7 @@ example data and test your transformations etc. In order to do that, you limit h be yielded by a resource by calling `resource.add_limit` method. In the example below we load just 10 first items from and infinite counter - that would otherwise never end. -```python +```py r = dlt.resource(itertools.count(), name="infinity").add_limit(10) assert list(r) == list(range(10)) ``` @@ -375,7 +375,7 @@ that will keep just one updated record per `user_id`. It also adds ["last value" incremental loading](incremental-loading.md#incremental_loading-with-last-value) on `created_at` column to prevent requesting again the already loaded records: -```python +```py tables = sql_database() tables.users.apply_hints( write_disposition="merge", @@ -386,7 +386,7 @@ pipeline.run(tables) ``` To just change a name of a table to which resource will load data, do the following: -```python +```py tables = sql_database() tables.users.table_name = "other_users" ``` @@ -398,7 +398,7 @@ with the existing schema in the same way `apply_hints` method above works. There should avoid lengthy operations (ie. reflecting database tables) during creation of the DAG so it is better do do it when DAG executes. You may also emit partial hints (ie. precision and scale for decimal types) for column to help `dlt` type inference. -```python +```py @dlt.resource def sql_table(credentials, schema, table): # create sql alchemy engine @@ -432,7 +432,7 @@ You can emit columns as Pydantic model and use dynamic hints (ie. lambda for tab ### Duplicate and rename resources There are cases when you your resources are generic (ie. bucket filesystem) and you want to load several instances of it (ie. files from different folders) to separate tables. In example below we use `filesystem` source to load csvs from two different folders into separate tables: -```python +```py @dlt.resource(standalone=True) def filesystem(bucket_url): # list and yield files in bucket_url @@ -463,7 +463,7 @@ You can pass individual resources or list of resources to the `dlt.pipeline` obj loaded outside the source context, will be added to the [default schema](schema.md) of the pipeline. -```python +```py @dlt.resource(name='table_name', write_disposition='replace') def generate_rows(nr): for i in range(nr): @@ -485,6 +485,6 @@ To do a full refresh of an `append` or `merge` resources you temporarily change disposition to replace. You can use `apply_hints` method of a resource or just provide alternative write disposition when loading: -```python +```py p.run(merge_source(), write_disposition="replace") ``` diff --git a/docs/website/docs/general-usage/schema-contracts.md b/docs/website/docs/general-usage/schema-contracts.md index 764b565beb..1b5e67357a 100644 --- a/docs/website/docs/general-usage/schema-contracts.md +++ b/docs/website/docs/general-usage/schema-contracts.md @@ -49,7 +49,7 @@ The `schema_contract` argument accepts two forms: 2. **shorthand** a contract mode (string) that will be applied to all schema entities. For example setting `schema_contract` to *freeze* will expand to the full form: -```python +```py {"tables": "freeze", "columns": "freeze", "data_type": "freeze"} ``` @@ -65,7 +65,7 @@ You can change the contract on the **source** instance via `schema_contract` pro Pydantic models can be used to [define table schemas and validate incoming data](resource.md#define-a-schema-with-pydantic). You can use any model you already have. `dlt` will internally synthesize (if necessary) new models that conform with the **schema contract** on the resource. Just passing a model in `column` argument of the [dlt.resource](resource.md#define-a-schema-with-pydantic) sets a schema contract that conforms to default Pydantic behavior: -```python +```py { "tables": "evolve", "columns": "discard_value", @@ -121,10 +121,10 @@ Here's how `dlt` deals with column modes: When contract is violated in freeze mode, `dlt` raises `DataValidationError` exception. This exception gives access to the full context and passes the evidence to the caller. As with any other exception coming from pipeline run, it will be re-raised via `PipelineStepFailed` exception which you should catch in except: -```python +```py try: pipeline.run() -except as pip_ex: +except Exception as pip_ex: if pip_ex.step == "normalize": if isinstance(pip_ex.__context__.__context__, DataValidationError): ... @@ -195,7 +195,7 @@ def items(): def other_items(): ... -@dlt.source(schema_contract={"columns": "freeze", "data_type": "freeze"}): +@dlt.source(schema_contract={"columns": "freeze", "data_type": "freeze"}) def source(): return [items(), other_items()] diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md index 7ce1d959c9..164814010d 100644 --- a/docs/website/docs/general-usage/schema.md +++ b/docs/website/docs/general-usage/schema.md @@ -149,7 +149,7 @@ Now imagine the data has changed and `id` field also contains strings ```py data = [ - {"id": 1, "human_name": "Alice"} + {"id": 1, "human_name": "Alice"}, {"id": "idx-nr-456", "human_name": "Bob"} ] ``` @@ -308,7 +308,7 @@ schema available via `dlt.current.source_schema()`. Example: -```python +```py @dlt.source def textual(nesting_level: int): # get the source schema from the `current` context diff --git a/docs/website/docs/general-usage/source.md b/docs/website/docs/general-usage/source.md index 1b3d1ce0cc..bcdd137dce 100644 --- a/docs/website/docs/general-usage/source.md +++ b/docs/website/docs/general-usage/source.md @@ -26,7 +26,7 @@ You declare source by decorating an (optionally async) function that return or y You can create resources by using `dlt.resource` as a function. In an example below we reuse a single generator function to create a list of resources for several Hubspot endpoints. -```python +```py @dlt.source def hubspot(api_key=dlt.secrets.value): @@ -59,7 +59,7 @@ If this is impractical (for example you want to reflect a database to create res You can access resources present in a source and select which of them you want to load. In case of `hubspot` resource above we could select and load "companies", "deals" and "products" resources: -```python +```py from hubspot import hubspot source = hubspot() @@ -73,7 +73,7 @@ pipeline.run(source.with_resources("companies", "deals")) Resources can be individually accessed and selected: -```python +```py # resources are accessible as attributes of a source for c in source.companies: # enumerate all data in companies resource print(c) @@ -89,7 +89,7 @@ source.deals.selected = False You can modify and filter data in resources, for example if we want to keep only deals after certain date: -```python +```py source.deals.add_filter(lambda deal: deal["created_at"] > yesterday) ``` @@ -103,7 +103,7 @@ You can easily get your test dataset in a few minutes, when otherwise you'd need the full loading to complete. Below we limit the `pipedrive` source to just get 10 pages of data from each endpoint. Mind that the transformers will be evaluated fully: -```python +```py from pipedrive import pipedrive_source pipeline = dlt.pipeline(pipeline_name='pipedrive', destination='duckdb', dataset_name='pipedrive_data') @@ -121,7 +121,7 @@ declare a new [transformer that takes the data from](resource.md#feeding-data-from-one-resource-into-another) `deals` resource and add it to the source. -```python +```py import dlt from hubspot import hubspot @@ -140,11 +140,11 @@ source.resources.add(source.deals | deal_scores) pipeline.run(source) ``` You can also set the resources in the source as follows -```python +```py source.deal_scores = source.deals | deal_scores ``` or -```python +```py source.resources["deal_scores"] = source.deals | deal_scores ``` :::note @@ -156,7 +156,7 @@ When adding resource to the source, `dlt` clones the resource so your existing i You can limit how deep `dlt` goes when generating child tables. By default, the library will descend and generate child tables for all nested lists, without limit. -```python +```py @dlt.source(max_table_nesting=1) def mongo_db(): ... @@ -172,7 +172,7 @@ tables of child tables). Typical settings: You can achieve the same effect after the source instance is created: -```python +```py from mongo_db import mongo_db source = mongo_db() @@ -202,7 +202,7 @@ You are also free to decompose a single source into several ones. For example, y down a 50 table copy job into an airflow dag with high parallelism to load the data faster. To do so, you could get the list of resources as: -```python +```py # get a list of resources' names resource_list = sql_source().resources.keys() @@ -216,12 +216,12 @@ for res in resource_list: You can temporarily change the "write disposition" to `replace` on all (or selected) resources within a source to force a full refresh: -```python +```py p.run(merge_source(), write_disposition="replace") ``` With selected resources: -```python +```py p.run(tables.with_resources("users"), write_disposition="replace") ``` diff --git a/docs/website/docs/general-usage/state.md b/docs/website/docs/general-usage/state.md index 23625db27c..0ab2b8a658 100644 --- a/docs/website/docs/general-usage/state.md +++ b/docs/website/docs/general-usage/state.md @@ -15,7 +15,7 @@ You read and write the state in your resources. Below we use the state to create game archives which we then use to [prevent requesting duplicates](incremental-loading.md#advanced-state-usage-storing-a-list-of-processed-entities). -```python +```py @dlt.resource(write_disposition="append") def players_games(chess_url, player, start_month=None, end_month=None): # create or request a list of archives from resource scoped state diff --git a/docs/website/docs/getting-started.md b/docs/website/docs/getting-started.md index cd121b0ad5..ecaa78c949 100644 --- a/docs/website/docs/getting-started.md +++ b/docs/website/docs/getting-started.md @@ -20,13 +20,13 @@ Let's get started! Install dlt using `pip`: -```bash +```sh pip install -U dlt ``` The command above installs (or upgrades) the library core, in the example below we use DuckDB as a destination so let's add a `duckdb` dependency: -```bash +```sh pip install "dlt[duckdb]" ``` @@ -63,13 +63,13 @@ When you look at the code above, you can see that we: Save this Python script with the name `quick_start_pipeline.py` and run the following command: -```bash +```sh python quick_start_pipeline.py ``` The output should look like: -```bash +```sh Pipeline quick_start completed in 0.59 seconds 1 load package(s) were loaded to destination duckdb and into dataset mydata The duckdb destination used duckdb:////home/user-name/quick_start/quick_start.duckdb location to store data @@ -82,13 +82,13 @@ Load package 1692364844.460054 is LOADED and contains no failed jobs To allow sneak peek and basic discovery you can take advantage of [built-in integration with Strealmit](reference/command-line-interface#show-tables-and-data-in-the-destination): -```bash +```sh dlt pipeline quick_start show ``` **quick_start** is the name of the pipeline from the script above. If you do not have Streamlit installed yet do: -```bash +```sh pip install streamlit ``` diff --git a/docs/website/docs/reference/command-line-interface.md b/docs/website/docs/reference/command-line-interface.md index b37a3a118e..599ffd3ebd 100644 --- a/docs/website/docs/reference/command-line-interface.md +++ b/docs/website/docs/reference/command-line-interface.md @@ -8,7 +8,7 @@ keywords: [command line interface, cli, dlt init] ## `dlt init` -```shell +```sh dlt init ``` This command creates new dlt pipeline script that loads data from `source` to `destination` to it. When you run the command: @@ -26,7 +26,7 @@ version if run again with existing `source` name. You are warned if files will b You can use `--location ` option to specify your own repository with sources. Typically you would [fork ours](https://github.com/dlt-hub/verified-sources) and start customizing and adding sources ie. to use them for your team or organization. You can also specify a branch with `--branch ` ie. to test a version being developed. ### List all verified sources -```shell +```sh dlt init --list-verified-sources ``` Shows all available verified sources and their short descriptions. For each source, checks if your local `dlt` version requires update @@ -43,7 +43,7 @@ that will add additional packages to current environment. ### github-action -```shell +```sh dlt deploy