-
Notifications
You must be signed in to change notification settings - Fork 188
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Docs: add a note to the Databricks docs on Azure (#1962)
- Loading branch information
Showing
1 changed file
with
20 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,9 @@ keywords: [Databricks, destination, data warehouse] | |
*Big thanks to Evan Phillips and [swishbi.com](https://swishbi.com/) for contributing code, time, and a test environment.* | ||
|
||
## Install dlt with Databricks | ||
|
||
**To install the dlt library with Databricks dependencies:** | ||
|
||
```sh | ||
pip install "dlt[databricks]" | ||
``` | ||
|
@@ -91,14 +93,17 @@ If you already have your Databricks workspace set up, you can skip to the [Loade | |
## Loader setup guide | ||
|
||
**1. Initialize a project with a pipeline that loads to Databricks by running** | ||
|
||
```sh | ||
dlt init chess databricks | ||
``` | ||
|
||
**2. Install the necessary dependencies for Databricks by running** | ||
|
||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
This will install dlt with the `databricks` extra, which contains the Databricks Python dbapi client. | ||
|
||
**4. Enter your credentials into `.dlt/secrets.toml`.** | ||
|
@@ -130,22 +135,22 @@ For more information on staging, see the [staging support](#staging-support) sec | |
|
||
## Supported file formats | ||
* [insert-values](../file-formats/insert-format.md) is used by default. | ||
* [jsonl](../file-formats/jsonl.md) supported when staging is enabled (see limitations below). | ||
* [parquet](../file-formats/parquet.md) supported when staging is enabled. | ||
* [JSONL](../file-formats/jsonl.md) supported when staging is enabled (see limitations below). | ||
* [Parquet](../file-formats/parquet.md) supported when staging is enabled. | ||
|
||
The `jsonl` format has some limitations when used with Databricks: | ||
The JSONL format has some limitations when used with Databricks: | ||
|
||
1. Compression must be disabled to load jsonl files in Databricks. Set `data_writer.disable_compression` to `true` in the dlt config when using this format. | ||
2. The following data types are not supported when using the JSONL format with `databricks`: `decimal`, `json`, `date`, `binary`. Use `parquet` if your data contains these types. | ||
3. The `bigint` data type with precision is not supported with the `jsonl` format. | ||
|
||
## Staging support | ||
|
||
Databricks supports both Amazon S3, Azure Blob Storage and Google Cloud Storage as staging locations. `dlt` will upload files in `parquet` format to the staging location and will instruct Databricks to load data from there. | ||
Databricks supports both Amazon S3, Azure Blob Storage and Google Cloud Storage as staging locations. `dlt` will upload files in Parquet format to the staging location and will instruct Databricks to load data from there. | ||
|
||
### Databricks and Amazon S3 | ||
|
||
Please refer to the [S3 documentation](./filesystem.md#aws-s3) for details on connecting your S3 bucket with the bucket_url and credentials. | ||
Please refer to the [S3 documentation](./filesystem.md#aws-s3) for details on connecting your S3 bucket with the `bucket_url` and `credentials`. | ||
|
||
Example to set up Databricks with S3 as a staging destination: | ||
|
||
|
@@ -165,12 +170,18 @@ pipeline = dlt.pipeline( | |
|
||
### Databricks and Azure Blob Storage | ||
|
||
Refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) for details on connecting your Azure Blob Storage container with the bucket_url and credentials. | ||
Refer to the [Azure Blob Storage filesystem documentation](./filesystem.md#azure-blob-storage) for details on connecting your Azure Blob Storage container with the `bucket_url` and `credentials`. | ||
|
||
Databricks requires that you use ABFS URLs in the following format: | ||
**abfss://container_name@storage_account_name.dfs.core.windows.net/path** | ||
To enable support for Azure Blob Storage with dlt, make sure to install the necessary dependencies by running: | ||
|
||
`dlt` is able to adapt the other representation (i.e., **az://container-name/path**), but we recommend that you use the correct form. | ||
```sh | ||
pip install "dlt[az]" | ||
``` | ||
|
||
:::note | ||
Databricks requires that you use ABFS URLs in the following format: `abfss://container_name@storage_account_name.dfs.core.windows.net/path`. | ||
dlt is able to adapt the other representation (i.e., `az://container-name/path`), but we recommend that you use the correct form. | ||
::: | ||
|
||
Example to set up Databricks with Azure as a staging destination: | ||
|
||
|
@@ -184,7 +195,6 @@ pipeline = dlt.pipeline( | |
staging=dlt.destinations.filesystem('abfss://[email protected]'), # add this to activate the staging location | ||
dataset_name='player_data' | ||
) | ||
|
||
``` | ||
|
||
### Databricks and Google Cloud Storage | ||
|