Merge pull request #20 from techindicium/major/change_models_to_get_d…

…ata_from_system_tables Major/change models to get data from system tables
techindicium · Sep 4, 2024 · 6796ae9 · 6796ae9
2 parents ea36223 + a9a2724
commit 6796ae9
Show file tree

Hide file tree

Showing 22 changed files with 854 additions and 469 deletions.
diff --git a/README.md b/README.md
@@ -31,6 +31,31 @@ packages:
 ```
 Then run `dbt deps` to finish the setup.
 
+## About the sources
+
+This package uses four main sources:
+
+- list_prices
+- usage
+- warehouses
+- clusters
+
+list_prices and usage are system tables of Databricks, located inside `system.billing`. 
+
+The warehouses and clusters tables contain informations cuptured by a REST API request.
+
+More information about the endpoints could be finded here:
+
+[Databricks REST API endpoints](https://docs.databricks.com/api/workspace/introduction)
+
+[Warehouses endpoint](https://docs.databricks.com/api/workspace/warehouses/list)
+
+[Clusters endpoint](https://docs.databricks.com/api/workspace/clusters/list)
+
+You can use our adf tap to extract this informations:
+
+[Platform Meltano on Databricks](https://bitbucket.org/indiciumtech/platform_meltano_on_databricks/src/main/)
+
 ## Define database and schema
 The location of the raw data to be used in this package is configurable, so it's importante to complete the following information at `dbt_project.yml`:
 ```yaml
@@ -46,6 +71,10 @@ vars:
   databricks_billing_schema: # name of the schema
 ```
 
+In this case, it's important to put the name of the catalog and schema where the tables `warehouses` and `clusters` are.
 
+## Recommendation
 
+We strongly recommend that the job that will run this package to be apart from the job that runs the models in production.
 
+This allows to prevent trubles with errors with the package that could make models in production to crash.
diff --git a/models/calendar/dim_dates.sql b/models/calendar/dim_dates.sql
diff --git a/models/marts/dim_databricks_analytics_cluster.sql b/models/marts/dim_databricks_analytics_cluster.sql
diff --git a/models/marts/dim_databricks_analytics_cluster.yml b/models/marts/dim_databricks_analytics_cluster.yml
diff --git a/models/marts/dim_databricks_analytics_clusters.sql b/models/marts/dim_databricks_analytics_clusters.sql
@@ -0,0 +1,37 @@
+with 
+    clusters as (
+        select *
+        from {{ ref('stg_databricks_analytics_clusters') }}
+    )
+
+    , create_surrogate_key as (
+        select
+            {{ dbt_utils.generate_surrogate_key(['cluster_id']) }} as cluster_sk
+            , cluster_id
+            , cluster_name
+            , cluster_source
+            , creator_user_name
+            , autotermination_minutes
+            , driver_node_type_id
+            , enable_elastic_disk
+            , enable_local_disk_encryption
+            , init_scripts_safe_mode
+            , instance_node_type_id
+            , last_state_loss_time
+            , node_type_id
+            , num_workers
+            , spark_context_id
+            , spark_version
+            , start_time
+            , cluster_state
+            , state_message
+            , terminated_time
+            , termination_reason_code
+            , termination_reason_type
+            , inserted_date
+        from clusters
+    )
+
+select *
+from create_surrogate_key
+
diff --git a/models/marts/dim_databricks_analytics_clusters.yml b/models/marts/dim_databricks_analytics_clusters.yml
@@ -0,0 +1,55 @@
+version: 2
+
+models:
+  - name: dim_databricks_analytics_clusters
+    description: Information about the clusters of Databricks.
+    columns:
+      - name: cluster_sk
+        description: Primary key of the table
+        data_tests:
+          - unique
+          - not_null
+      - name: cluster_id
+        description: Unique identifier of the cluster
+      - name: cluster_name
+        description: Name of the cluster
+      - name: cluster_source
+        description: Source of the cluster
+      - name: creator_user_name
+        description: Name of the user who created the cluster
+      - name: autotermination_minutes
+        description: Number of minutes the cluster will wait for a job to run before terminating
+      - name: driver_node_type_id
+        description: Node type of the driver
+      - name: enable_elastic_disk
+        description: Whether the cluster is using elastic disk
+      - name: enable_local_disk_encryption
+        description: Whether the cluster is using local disk encryption
+      - name: init_scripts_safe_mode
+        description: Whether the cluster is using safe mode for init scripts
+      - name: instance_node_type_id
+        description: Node type of the instances
+      - name: last_state_loss_time
+        description: Date of the last time the cluster entered a state loss
+      - name: node_type_id
+        description: Node type of the cluster
+      - name: num_workers
+        description: Number of workers in the cluster
+      - name: spark_context_id
+        description: ID of the Spark context
+      - name: spark_version
+        description: Version of Spark
+      - name: start_time
+        description: Date when the cluster was started
+      - name: cluster_state
+        description: State of the cluster
+      - name: state_message
+        description: Message describing the state of the cluster
+      - name: terminated_time
+        description: Date when the cluster was terminated
+      - name: termination_reason_code
+        description: Code of the reason why the cluster was terminated
+      - name: termination_reason_type
+        description: Type of the reason why the cluster was terminated
+      - name: inserted_date
+        description: Date when the information was inserted in the table
diff --git a/models/marts/dim_databricks_analytics_sku.sql b/models/marts/dim_databricks_analytics_sku.sql
diff --git a/models/marts/dim_databricks_analytics_sku.yml b/models/marts/dim_databricks_analytics_sku.yml