Skip to content

Commit

Permalink
Merge pull request #80 from dfe-analytical-services/databricks_fundam…
Browse files Browse the repository at this point in the history
…entals

Databricks fundamentals
  • Loading branch information
jen-machin authored Sep 6, 2024
2 parents 17a2926 + aecd5d4 commit 02bed1d
Show file tree
Hide file tree
Showing 33 changed files with 898 additions and 0 deletions.
234 changes: 234 additions & 0 deletions ADA/databricks_fundamentals.qmd

Large diffs are not rendered by default.

60 changes: 60 additions & 0 deletions ADA/databricks_notebooks.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: "Databricks Notebooks"
---

------------------------------------------------------------------------

## Notebooks

------------------------------------------------------------------------

Notebooks are a special kind of script that Databricks supports. They consist of code blocks and markdown blocks which can contain formatted text, links and images. Due to the ability to combine markdown with code they are very well suited to creating and documenting data pipelines, such as creating a core dataset that underpins your other products. They are particularly powerful when parameterised and used in conjuction with [Workflows](ADA/Databricks_workflows.qmd).

You can create a notebook in your workspace, either in a folder or a repository.\
To do this locate the folder/repository you want to create the notebook in then click the 'Create' button and select Notebook.

::: callout-tip
Any notebooks used for core business processes are created in a repository linked to GitHub/DevOps where they can be version controlled.
:::

Once you've created a notebook it will automatically be opened. Any changes you made are saved in real time so the notebook will always keep the latest version of it's contents. In order to 'save' a snapshot of your work it is recommended to use git commits.

You can change the title from 'Untitled Notebook *\<timestamp\>*' (1), and set it's default language in the drop down immediately to the right of the notebook title (2).

![](/images/ada-notebook.png)

The default language is the language the notebook will assume all code chunks are written in. In the screenshot above the default language is 'R', so all chunks will be assumed to be written in R unless otherwise specified.

You can also add markdown cells to add text, links and graphics to your notebook in order to document the processing done within it.

To add a new code of markdown chunk move the mouse above or below another chunk and the buttons '+Code' and '+Text' will appear.

![](/images/ada-notebook-add-chunk.png)

To run code chunks you'll first need to attach your compute resource to it by clicking the 'Connect' button in the top right hand side of the page.

![](/images/ada-notebook-attach.png)

You can run a code chunk either by pressing the play button in the top left corner of the chunk, or by pressing Ctrl + Return/Enter on the keyboard. Any outputs that result from the code will be displayed underneath the chunk.

::: callout-tip
## In R

If you try to `View()` a data frame in R you'll notice that the function doesn't work within Databricks. Instead Databricks providers the `display()` function for R users to view their data with.
:::

![](/images/ada-notebook-chunk-output.png)

Everything ran in a notebook is in it's own 'session' meaning that later chunks have access to variables, functions, etc. that were defined above. Chunks can be ran manually, however doing this runs the risk of running code out of order and may consequently produce unexpected results. To avoid this all chunks can be ran in order from the beginning of the Notebook using the 'Run all' button at the top of the page, alternatively you can 'Run all above' or 'Run all below' from any code chunk.

Notebooks cannot share a session with another Notebook so bear this in mind when constructing your workflows. If you need to pass data between notebooks it can be written out to a table in the unity catalog using SQL / R / Python / Scala as you would write to a SQL table in SQL Server. This can then be accessed from later notebooks.

Notebooks can be parameterised using '[widgets](https://docs.Databricks.com/en/notebooks/widgets.html)', meaning a single notebook can be re-used with different inputs. This means they can be used in a similar way to a function in R/Python or a stored procedure in SQL.

::: callout-note
## Don't repeat yourself (DRY)

A coding best-practice is to build components that can be re-used to perform many similar tasks rather than writing repetitive code.

This applies equally to notebooks or scripts you create within Databricks which can be made re-usable through parameters. To reference a parameter within a notebook you can use the syntax `:parameter_name` from Databricks Runtime 15+. In previous DBR versions the syntax was `${parameter_name}`.
:::
249 changes: 249 additions & 0 deletions ADA/databricks_workflow_script_databricks.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
---
title: "Scripting Workflows in Databricks"
---

Workflows can be constructed through the Databricks Workflows user interface (UI), however for large or complex workflows the UI can be a time consuming way to build a workflow. In these scenarios it is quicker and more inline with RAP principles to script your workflow.

For a pipeline to be built there must be scripts, queries or notebooks available to read by Databricks, either located in your workspace, or in a Git repository.

For this example we will create a folder in our workspace, create two test notebooks to comprise the workflow, and a third to script the job and set it off running. We'll also set it up to notify us by email when the workflow successfully completes.

1. Create a folder in your workspace to store your notebooks. First click 'Workspace' in the sidebar (1), then navigate to your user folder (2). Then click the blue 'Create' button (3) and select 'Folder'. Give the folder a name (4) such as 'Test workflow' and then click the blue 'Create' button (5).\
\
![](/images/ada-workflow-folder.png)

2. Once in your new folder and click the blue 'Create' button again, this time choosing 'Notebook'. Once in your new Notebook retitle it to 'Test task 1' (1), and set the default language to R (2). Then in the first code chunk write `print("This is a test task")` (3).\
![](/images/ada-workflow-notebook-test1.png)

3. Create a second workbook in the same folder titled 'Test task 2', and in the first code chunk write `print("This is another test task")` .\
![](/images/ada-workflow-notebook-test-2.png)

4. Create a third notebook and title it 'Create and run job'. In the first cell load the `tidyverse`package. Install the `devtools` package and load it, then use it's `install_github()` function to install the `databricks` package. Load the newly installed `databricks` package.\

``` r
library(tidyverse)

install.packages("devtools")
library(devtools)
install_github("databrickslabs/databricks-sdk-r")
library(databricks)
```

5. Create a new code chunk and create a [text widget](https://docs.databricks.com/en/notebooks/widgets.html) to contain your DataBricks access token. Run this cell to create the widget at the top of the page. Once the widget is there add in your personal access token into the text box.

``` r
dbutils.widgets.text("api_token", "")
```

::: callout-note
## Databricks access token

Personal Authentication Token (PAT)s are a unique code that is generated to let Databricks know who is accessing it from the outside. It functions as a password and so **must not be shared with anyone**.

If you haven't already generated a Databricks token you can find instructions on how to do so in the [Setup Databricks personal compute cluster with RStudio](databricks_rstudio_personal_cluster.qmd) article.
:::

::: callout-important
## Don't put your token into the code

The reason we're using a widget here for your access token is that we don't want to take any risk of someone else being able to view your PAT token. If we were to hardcode it into the notebook then anyone with access to the code would be able to copy your PAT token and 'masquerade' as you.
:::

6. We can now connect to the API through the `databricks` package using the `databricks::DatabricksClient()` function. It requires the `host` which is the URL of the Databricks platform up until (and including) the first `/`, and your token. We'll store the result in a variable called `client` as we need to pass this to the other functions in the `databricks` library.\
We can then use the `databricks::clustersList()` function to fetch a list of the clusters, which we can view using `display()`.

``` r
host <- "https://adb-5037484389568426.6.azuredatabricks.net/"
api_token <- dbutils.widgets.get("api_token")

client <- databricks::DatabricksClient(host = host, token = api_token)

clusters <- databricks::clustersList(client)

display(clusters %>% select(cluster_id, creator_user_name, cluster_name, spark_version, cluster_cores, cluster_memory_mb))
```

![](/images/ada-workflows-databricks-api-cluster.png)

::: callout-note
## Clusters

The `databricks::clustersList()` function will return any clusters that you have permission to see.

The data returned by the function is hierarchical, and a single 'column' may contain several other columns. As the `display()` function renders a table, you'll have to select only columns that `display()` knows how to show. Generally, the columns that are at the left-most position when you run `str(databricks::clustersList(client))` (shows the structure).
:::
7. Make a note of your cluster ID and save it in a variable called `cluster_id`. You could automate this step by filtering the `clusters` data frame as long as you ensure that it only results a single `cluster_id`.
``` r
cluster_id <- "<your cluster id>"
```
8. Create a new code block and we'll start by setting some parameters for the job. Firstly we'll need a `job_name`, and the paths to the Notebooks we're wanting to include in the workflow. We'll also need to create a unique `task_key` for each of the Notebook tasks we're going to set up.

``` r
job_name <- "test job"
first_notebook_path <- "/Users/[email protected]/R SDK/Test Notebook"
second_notebook_path <- "/Users/[email protected]/R SDK/Test Notebook 2"
task_key_1 <- "test_key"
task_key_2 <- "test_key_2"
```

We can then define the tasks as lists. There are many options available available for setting when creating a task, a full list of which can be found in the [tasks section of the job API documentation](https://docs.databricks.com/api/workspace/jobs/create#tasks). When reading this documentation any parameter that is marked as an *object* needs to be passed as a *list* (`list()`) in R, and anything marked as an *array* should be passed as a *vector* (`c()`).

For the first task we'll give it the first `task_key` we created above, and tell it to run on our existing cluster by passing the ID of our cluster to `existing_cluster_id`, we'll then specify that it is a `notebook_task` and pass that a list with the `notebook_path` and the `source` which we will set to `WORKSPACE` (as opposed to a Git repository) for the purposes of this tutorial.

``` r
first_job_task <- list(
task_key = task_key_1,
existing_cluster_id = cluster_id,
notebook_task = list(
notebook_path = first_notebook_path,
source = "WORKSPACE"
)
)
```

For the second task we will do the same, but using the second `task_key` and `notebook_path` we defined. In addition, we'll also add a `depends_on` clause with the previous `task_key` (passed in a list), and specify it is only to `run_if` `ALL_SUCCESS`. This means that the second task won't begin processing unless all of the tasks it `depends_on` have completed successfully.

``` r
second_job_task <- list(
task_key = task_key_2,
existing_cluster_id = cluster_id,
notebook_task = list(
notebook_path = second_notebook_path,
source = "WORKSPACE"
),
depends_on = list(task_key = task_key_1),
run_if = "ALL_SUCCESS"
)
```

9. Now we have both of our tasks defined we can create the job using the `databricks::jobCreate()` function. We pass it the `client` as the first argument, then the job `name` we defined. The `tasks` are passed as a list which contains each of the task lists we built above.\
We'll also tell it to send us `email_notifications` by passing a list with an `on_success` value of email addresses.\
\
The function returns the ID of the job we just created, so we will want to store the response in a variable called `workflow` so we can refer to it later.

``` r
workflow <- jobsCreate(client,
name = job_name,
tasks = list(first_job_task, #list
second_job_task), #list
email_notifications = list(
on_success = c("your-email")
))
```

::: callout-note
## Lists of lists

A `list()` in R is used to contain any number and type of data, including other `list()`s. This makes it excellent for storing hierarchical data in one place, however it can get quite confusing quite quickly.

**Sometimes it's easier to break these `lists()` up into pieces** by defining them seperately, as we did above by defining the task lists separately then passing them to the `tasks` argument in the `jobsCreate()` function.
This often makes it easier to think about and construct, but certainly makes it easier to read. Consider the code below which does exactly the same thing as the code above, but is just written all at once.
``` r
workflow <- jobsCreate(client,
name = job_name,
tasks = list(
list(
task_key = task_key_1,
existing_cluster_id = cluster_id,
notebook_task = list(
notebook_path = first_notebook_path,
source = "WORKSPACE"
)
),
list(
task_key = task_key_2,
existing_cluster_id = cluster_id,
notebook_task = list(
notebook_path = second_notebook_path,
source = "WORKSPACE"
),
depends_on = list(task_key = task_key_1),
run_if = "ALL_SUCCESS"
)
),
email_notifications = list(
on_success = c("your-email")
)
)
```
We can see here that the code is getting very long, and is also more difficult to see which options relate to which list. If it weren't for being diligent with indentation here we'd have to resort to counting brackets to see what belonged where. This is especially problematic if you accidentally delete a bracket and need to work out where it was meant to go.
:::
10. We can now get the ID of the job that was created and tell the API to run the job. In a new code chunk we'll store the `job_id` from the `workflow` variable above. We'll then use the `databricks::jobsRunNow()` function to tell it to run the workflow we just created by passing it the `job_id` we just stored. We'll also store the `job_run_id` returned by the `databricks::jobsRunNow()` function.

``` r
job_id <- workflow$job_id

job_run <- jobsRunNow(client,
job_id = job_id)

job_run_id <- job_run$run_id
```

We will now use this to create links to the job and the specific run of the job we just set off.

11. In a new code cell, define a `job_link` by `paste0()`ing the `host` variable we passed to the `databricks::DatabricksClient()` function earlier, followed by `"job/"` followed by the `job_id` defined above.\
We can then create a `job_run_link` by `paste0()`ing together the `job_link` followed by `"/runs/"` then the `job_run_id` from the previous step.\
We can then output the `job_link` as text at the bottom of the cell.

``` r
job_link <- paste0(host,"jobs/",job_id)
job_run_link <- paste0(job_link,"/runs/", job_run_id)
job_link
```

12. In a new cell, output the `job_run_link`.

![](/images/ada-workflow-output-link.png)

::: callout-note
## Output limits on code chunks

Each chunk will display an output (assuming there are any) underneath the chunk once it has been run. Each chunk is limited to a single output though, which defaults to the last output generated.

So if we had written a cell to output both links at the same time, we would still only see the `job_run_link`.

![](/images/ada-workflows-output-limitations.png)
:::

13. Now click on the links and check they work.

14. You've now created a workflow with code, and each time you re-run this notebook another workflow with the same name will be created. As this is a tutorial which most analysts may have to follow at some point, the logical conclusion is that we will end up with hundreds of 'test jobs' cluttering up the workflow page.\
\
To avoid that let's use the `databricks::jobsDelete()` function to clean up after ourselves. All that we need to do is pass the function the `client`, and `job_id` variables from above.

``` r
jobsDelete(client, job_id)
```

::: callout-note
If you have been running and re-running bits of this code iteratively, there's a good chance you already have several instances of 'test job' listed under your name.
![](/images/ada-workflows-multiple-jobs.png)
If this is the case we'll want to clean up each of these, ideally without having to manually click through the UI process for each one.

To do this, firstly call the `databricks::jobsList()` function, passing it the `client` variable, and specifying the `name` of the jobs you want to list. Then filter the list to just the jobs with a `creator_user_name` of your email address. To see the resulting jobs use the `display()` function as below at the bottom of a code chunk.

``` r
my_jobs <- jobsList(client,
name = "test job") %>%
filter(creator_user_name == '[email protected]')

display(my_jobs %>% select(job_id, creator_user_name, run_as_user_name, created_time))
```

We can now loop through the individual `job_id`s contained in `my_jobs` and use the `databricks::jobsDelete()` function to remove them all at once, programmatically.

``` r
for(job_id in my_jobs$job_id){
jobsDelete(client, job_id)
}
```
:::
Loading

0 comments on commit 02bed1d

Please sign in to comment.