Skip to content

Commit

Permalink
Docs: add a note to the Databricks docs on Azure (#1962)
Browse files Browse the repository at this point in the history
  • Loading branch information
burnash authored Oct 21, 2024
1 parent 4f58c71 commit d469ed4
Showing 1 changed file with 20 additions and 10 deletions.
30 changes: 20 additions & 10 deletions docs/website/docs/dlt-ecosystem/destinations/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@ keywords: [Databricks, destination, data warehouse]
*Big thanks to Evan Phillips and [swishbi.com](https://swishbi.com/) for contributing code, time, and a test environment.*

## Install dlt with Databricks

**To install the dlt library with Databricks dependencies:**

```sh
pip install "dlt[databricks]"
```
Expand Down Expand Up @@ -91,14 +93,17 @@ If you already have your Databricks workspace set up, you can skip to the [Loade
## Loader setup guide

**1. Initialize a project with a pipeline that loads to Databricks by running**

```sh
dlt init chess databricks
```

**2. Install the necessary dependencies for Databricks by running**

```sh
pip install -r requirements.txt
```

This will install dlt with the `databricks` extra, which contains the Databricks Python dbapi client.

**4. Enter your credentials into `.dlt/secrets.toml`.**
Expand Down Expand Up @@ -130,22 +135,22 @@ For more information on staging, see the [staging support](#staging-support) sec

## Supported file formats
* [insert-values](../file-formats/insert-format.md) is used by default.
* [jsonl](../file-formats/jsonl.md) supported when staging is enabled (see limitations below).
* [parquet](../file-formats/parquet.md) supported when staging is enabled.
* [JSONL](../file-formats/jsonl.md) supported when staging is enabled (see limitations below).
* [Parquet](../file-formats/parquet.md) supported when staging is enabled.

The `jsonl` format has some limitations when used with Databricks:
The JSONL format has some limitations when used with Databricks:

1. Compression must be disabled to load jsonl files in Databricks. Set `data_writer.disable_compression` to `true` in the dlt config when using this format.
2. The following data types are not supported when using the JSONL format with `databricks`: `decimal`, `json`, `date`, `binary`. Use `parquet` if your data contains these types.
3. The `bigint` data type with precision is not supported with the `jsonl` format.

## Staging support

Databricks supports both Amazon S3, Azure Blob Storage and Google Cloud Storage as staging locations. `dlt` will upload files in `parquet` format to the staging location and will instruct Databricks to load data from there.
Databricks supports both Amazon S3, Azure Blob Storage and Google Cloud Storage as staging locations. `dlt` will upload files in Parquet format to the staging location and will instruct Databricks to load data from there.

### Databricks and Amazon S3

Please refer to the [S3 documentation](./filesystem.md#aws-s3) for details on connecting your S3 bucket with the bucket_url and credentials.
Please refer to the [S3 documentation](./filesystem.md#aws-s3) for details on connecting your S3 bucket with the `bucket_url` and `credentials`.

Example to set up Databricks with S3 as a staging destination:

Expand All @@ -165,12 +170,18 @@ pipeline = dlt.pipeline(

### Databricks and Azure Blob Storage

Refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) for details on connecting your Azure Blob Storage container with the bucket_url and credentials.
Refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) for details on connecting your Azure Blob Storage container with the `bucket_url` and `credentials`.

Databricks requires that you use ABFS URLs in the following format:
**abfss://container_name@storage_account_name.dfs.core.windows.net/path**
To enable support for Azure Blob Storage with dlt, make sure to install the necessary dependencies by running:

`dlt` is able to adapt the other representation (i.e., **az://container-name/path**), but we recommend that you use the correct form.
```sh
pip install "dlt[az]"
```

:::note
Databricks requires that you use ABFS URLs in the following format: `abfss://container_name@storage_account_name.dfs.core.windows.net/path`.
dlt is able to adapt the other representation (i.e., `az://container-name/path`), but we recommend that you use the correct form.
:::

Example to set up Databricks with Azure as a staging destination:

Expand All @@ -184,7 +195,6 @@ pipeline = dlt.pipeline(
staging=dlt.destinations.filesystem('abfss://[email protected]'), # add this to activate the staging location
dataset_name='player_data'
)

```

### Databricks and Google Cloud Storage
Expand Down

0 comments on commit d469ed4

Please sign in to comment.