Skip to content

Commit

Permalink
Merge pull request #20 from techindicium/major/change_models_to_get_d…
Browse files Browse the repository at this point in the history
…ata_from_system_tables

Major/change models to get data from system tables
  • Loading branch information
cmagno-ind authored Sep 4, 2024
2 parents ea36223 + a9a2724 commit 6796ae9
Show file tree
Hide file tree
Showing 22 changed files with 854 additions and 469 deletions.
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,31 @@ packages:
```
Then run `dbt deps` to finish the setup.

## About the sources

This package uses four main sources:

- list_prices
- usage
- warehouses
- clusters

list_prices and usage are system tables of Databricks, located inside `system.billing`.

The warehouses and clusters tables contain informations cuptured by a REST API request.

More information about the endpoints could be finded here:

[Databricks REST API endpoints](https://docs.databricks.com/api/workspace/introduction)

[Warehouses endpoint](https://docs.databricks.com/api/workspace/warehouses/list)

[Clusters endpoint](https://docs.databricks.com/api/workspace/clusters/list)

You can use our adf tap to extract this informations:

[Platform Meltano on Databricks](https://bitbucket.org/indiciumtech/platform_meltano_on_databricks/src/main/)

## Define database and schema
The location of the raw data to be used in this package is configurable, so it's importante to complete the following information at `dbt_project.yml`:
```yaml
Expand All @@ -46,6 +71,10 @@ vars:
databricks_billing_schema: # name of the schema
```

In this case, it's important to put the name of the catalog and schema where the tables `warehouses` and `clusters` are.

## Recommendation

We strongly recommend that the job that will run this package to be apart from the job that runs the models in production.

This allows to prevent trubles with errors with the package that could make models in production to crash.
158 changes: 0 additions & 158 deletions models/calendar/dim_dates.sql

This file was deleted.

24 changes: 0 additions & 24 deletions models/marts/dim_databricks_analytics_cluster.sql

This file was deleted.

26 changes: 0 additions & 26 deletions models/marts/dim_databricks_analytics_cluster.yml

This file was deleted.

37 changes: 37 additions & 0 deletions models/marts/dim_databricks_analytics_clusters.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
with
clusters as (
select *
from {{ ref('stg_databricks_analytics_clusters') }}
)

, create_surrogate_key as (
select
{{ dbt_utils.generate_surrogate_key(['cluster_id']) }} as cluster_sk
, cluster_id
, cluster_name
, cluster_source
, creator_user_name
, autotermination_minutes
, driver_node_type_id
, enable_elastic_disk
, enable_local_disk_encryption
, init_scripts_safe_mode
, instance_node_type_id
, last_state_loss_time
, node_type_id
, num_workers
, spark_context_id
, spark_version
, start_time
, cluster_state
, state_message
, terminated_time
, termination_reason_code
, termination_reason_type
, inserted_date
from clusters
)

select *
from create_surrogate_key

55 changes: 55 additions & 0 deletions models/marts/dim_databricks_analytics_clusters.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
version: 2

models:
- name: dim_databricks_analytics_clusters
description: Information about the clusters of Databricks.
columns:
- name: cluster_sk
description: Primary key of the table
data_tests:
- unique
- not_null
- name: cluster_id
description: Unique identifier of the cluster
- name: cluster_name
description: Name of the cluster
- name: cluster_source
description: Source of the cluster
- name: creator_user_name
description: Name of the user who created the cluster
- name: autotermination_minutes
description: Number of minutes the cluster will wait for a job to run before terminating
- name: driver_node_type_id
description: Node type of the driver
- name: enable_elastic_disk
description: Whether the cluster is using elastic disk
- name: enable_local_disk_encryption
description: Whether the cluster is using local disk encryption
- name: init_scripts_safe_mode
description: Whether the cluster is using safe mode for init scripts
- name: instance_node_type_id
description: Node type of the instances
- name: last_state_loss_time
description: Date of the last time the cluster entered a state loss
- name: node_type_id
description: Node type of the cluster
- name: num_workers
description: Number of workers in the cluster
- name: spark_context_id
description: ID of the Spark context
- name: spark_version
description: Version of Spark
- name: start_time
description: Date when the cluster was started
- name: cluster_state
description: State of the cluster
- name: state_message
description: Message describing the state of the cluster
- name: terminated_time
description: Date when the cluster was terminated
- name: termination_reason_code
description: Code of the reason why the cluster was terminated
- name: termination_reason_type
description: Type of the reason why the cluster was terminated
- name: inserted_date
description: Date when the information was inserted in the table
18 changes: 0 additions & 18 deletions models/marts/dim_databricks_analytics_sku.sql

This file was deleted.

17 changes: 0 additions & 17 deletions models/marts/dim_databricks_analytics_sku.yml

This file was deleted.

Loading

0 comments on commit 6796ae9

Please sign in to comment.