This dbt package contains macros that:
- can be (re)used across dbt projects running on Spark
- define Spark-specific implementations of dispatched macros from other packages
Check dbt Hub for the latest installation instructions, or read the docs for more information on installing packages.
This package provides "shims" for:
- dbt_utils, except for:
dbt_utils.get_relations_by_pattern
dbt_utils.groupby
dbt_utils.recency
dbt_utils.any_value
dbt_utils.listagg
dbt_utils.pivot
with apostrophe(s) in thevalues
- snowplow (tested on Databricks only)
In order to use these "shims," you should set a dispatch
config in your root project (on dbt v0.20.0 and newer). For example, with this project setting, dbt will first search for macro implementations inside the spark_utils
package when resolving macros from the dbt_utils
namespace:
dispatch:
- macro_namespace: dbt_utils
search_order: ['spark_utils', 'dbt_utils']
The spark-utils package may be able to provide compatibility for your package, especially if your package leverages dbt-utils macros for cross-database compatibility. This package does not need to be specified as a dependency of your package in packages.yml
. Instead, you should encourage anyone using your package on Apache Spark / Databricks to:
- Install
spark_utils
alongside your package - Add a
dispatch
config in their root project, like the one above
Caveat: These are not tested in CI, or guaranteed to work on all platforms.
Each of these macros accepts a regex pattern, finds tables with names matching the pattern, and will loop over those tables to perform a maintenance operation:
spark_optimize_delta_tables
: Runsoptimize
for all matched Delta tablesspark_vacuum_delta_tables
: Runsvacuum
for all matched Delta tablesspark_analyze_tables
: Compute statistics for all matched tables
We welcome contributions to this repo! To contribute a new feature or a fix,
please open a Pull Request with 1) your changes and 2) updated documentation for
the README.md
file.
The macros are tested with pytest
and
pytest-dbt-core
. For example,
the create_tables
macro is tested by:
- Create a test table (test setup):
spark_session.sql(f"CREATE TABLE {table_name} (id int) USING parquet")
- Call the macro generator:
tables = macro_generator()
- Assert test condition:
assert simple_table in tables
- Delete the test table (test cleanup):
spark_session.sql(f"DROP TABLE IF EXISTS {table_name}")
A macro is fetched using the
macro_generator
fixture and providing the macro name trough
indirect parameterization:
@pytest.mark.parametrize(
"macro_generator", ["macro.spark_utils.get_tables"], indirect=True
)
def test_create_table(macro_generator: MacroGenerator) -> None:
- What is dbt?
- Installation
- Join the #spark channel in dbt Slack
Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct.