Skip to content

Commit

Permalink
Updated layout for filesystem docs
Browse files Browse the repository at this point in the history
  • Loading branch information
VioletM committed May 27, 2024
1 parent edaa4bb commit 2672dba
Showing 1 changed file with 28 additions and 24 deletions.
52 changes: 28 additions & 24 deletions docs/website/docs/dlt-ecosystem/destinations/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,7 @@ pip install s3fs
so pip does not fail on backtracking.
:::

## Setup Guide

### 1. Initialise the dlt project
## Initialise the dlt project

Let's start by initializing a new dlt project as follows:
```sh
Expand All @@ -33,9 +31,9 @@ Let's start by initializing a new dlt project as follows:
This command will initialize your pipeline with chess as the source and the AWS S3 filesystem as the destination.
:::

### 2. Set up bucket storage and credentials
## Set up bucket storage and credentials

#### AWS S3
### AWS S3
The command above creates a sample `secrets.toml` and requirements file for AWS S3 bucket. You can install those dependencies by running:
```sh
pip install -r requirements.txt
Expand Down Expand Up @@ -100,7 +98,7 @@ You need to create an S3 bucket and a user who can access that bucket. `dlt` doe
5. To grab the access and secret key for the user. Go to IAM > Users and in the “Security Credentials”, click on “Create Access Key”, and preferably select “Command Line Interface” and create the access key.
6. Grab the “Access Key” and “Secret Access Key” created that are to be used in "secrets.toml".

##### Using S3 compatible storage
#### Using S3 compatible storage

To use an S3 compatible storage other than AWS S3 like [MinIO](https://min.io/) or [Cloudflare R2](https://www.cloudflare.com/en-ca/developer-platform/r2/), you may supply an `endpoint_url` in the config. This should be set along with AWS credentials:

Expand All @@ -114,7 +112,7 @@ aws_secret_access_key = "please set me up!" # copy the secret access key here
endpoint_url = "https://<account_id>.r2.cloudflarestorage.com" # copy your endpoint URL here
```

##### Adding Additional Configuration
#### Adding Additional Configuration

To pass any additional arguments to `fsspec`, you may supply `kwargs` and `client_kwargs` in the config as a **stringified dictionary**:

Expand All @@ -124,7 +122,7 @@ kwargs = '{"use_ssl": true, "auto_mkdir": true}'
client_kwargs = '{"verify": "public.crt"}'
```

#### Google Storage
### Google Storage
Run `pip install "dlt[gs]"` which will install the `gcfs` package.

To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`.
Expand All @@ -147,14 +145,14 @@ if you have default google cloud credentials in your environment (i.e. on cloud

Use **Cloud Storage** admin to create a new bucket. Then assign the **Storage Object Admin** role to your service account.

#### Azure Blob Storage
### Azure Blob Storage
Run `pip install "dlt[az]"` which will install the `adlfs` package to interface with Azure Blob Storage.

Edit the credentials in `.dlt/secrets.toml`, you'll see AWS credentials by default replace them with your Azure credentials.

Two forms of Azure credentials are supported:

##### SAS token credentials
#### SAS token credentials

Supply storage account name and either sas token or storage account key

Expand All @@ -174,7 +172,7 @@ If you have the correct Azure credentials set up on your machine (e.g. via azure
you can omit both `azure_storage_account_key` and `azure_storage_sas_token` and `dlt` will fall back to the available default.
Note that `azure_storage_account_name` is still required as it can't be inferred from the environment.

##### Service principal credentials
#### Service principal credentials

Supply a client ID, client secret and a tenant ID for a service principal authorized to access your container

Expand All @@ -188,18 +186,32 @@ azure_client_secret = "client_secret"
azure_tenant_id = "tenant_id" # please set me up!
```

#### Local file system
### Local file system
If for any reason you want to have those files in a local folder, set up the `bucket_url` as follows (you are free to use `config.toml` for that as there are no secrets required)

```toml
[destination.filesystem]
bucket_url = "file:///absolute/path" # three / for an absolute path
```

`dlt` correctly handles the native local file paths. Indeed, using the `file://` schema may be not intuitive especially for Windows users.
:::tip
For handling deeply nested layouts, consider enabling automatic directory creation for the local filesystem destination. This can be done by setting `kwargs` in `secrets.toml`:

```toml
[destination.filesystem]
kwargs = '{"auto_mkdir": true}'
```

Or by setting environment variable:
```sh
export DESTINATION__FILESYSTEM__KWARGS = '{"auto_mkdir": true/false}'
```
:::

`dlt` correctly handles the native local file paths. Indeed, using the `file://` schema may be not intuitive especially for Windows users.

```toml
[destination.unc_destination]
bucket_url = 'C:\a\b\c'
```

Expand Down Expand Up @@ -379,24 +391,16 @@ Please note:

The filesystem destination configuration supports advanced layout customization and the inclusion of additional placeholders. This can be done through `config.toml` or programmatically when initializing via a factory method.

To set using environment variables:
```sh
export DESTINATION__FILESYSTEM__KWARGS = '{"auto_mkdir": true/false}'
```

:::tip
For handling deeply nested layouts, consider enabling automatic directory creation for the local filesystem destination. This can be done by setting `kwargs = '{"auto_mkdir": true}'` to facilitate the creation of directories automatically.
:::

#### Configuration via `config.toml`

To configure the layout and placeholders using `config.toml`, use the following format:

```toml
[destination.filesystem]
layout = "{table_name}/{test_placeholder}/{YYYY}-{MM}-{DD}/{ddd}/{mm}/{load_id}.{file_id}.{ext}"
extra_placeholders = { "test_placeholder" = "test_value" }
current_datetime="2024-04-14T00:00:00"
# for automatic directory creation in the local filesystem
# for automatic directory creation in the local filesystem
kwargs = '{"auto_mkdir": true}'
```

Expand Down Expand Up @@ -485,4 +489,4 @@ managed in the regular way by the final destination you have configured.
You will also notice `init` files being present in the root folder and the special `dlt` folders. In the absence of the concepts of schemas and tables
in blob storages and directories, `dlt` uses these special files to harmonize the behavior of the `filesystem` destination with the other implemented destinations.

<!--@@@DLT_TUBA filesystem-->
<!--@@@DLT_TUBA filesystem-->

0 comments on commit 2672dba

Please sign in to comment.