Skip to content

Commit

Permalink
Adds support for COPY TO/FROM Azure Blob Storage
Browse files Browse the repository at this point in the history
Supports following Azure Blob uri forms:
- `az://{container}/key`
- `azure://{container}/key`
- `https://{account}.blob.core.windows.net/{container}/key`

**Configuration**

The simplest way to configure object storage is by creating the standard [`~/.azure/config`](https://learn.microsoft.com/en-us/cli/azure/azure-cli-configuration?view=azure-cli-latest) file:

```bash
$ cat ~/.azure/config
[storage]
account = devstoreaccount1
key = Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==
```

Alternatively, you can use the following environment variables when starting postgres to configure the Azure Blob Storage client:
- `AZURE_STORAGE_ACCOUNT`: the storage account name of the Azure Blob
- `AZURE_STORAGE_KEY`: the storage key of the Azure Blob
- `AZURE_STORAGE_SAS_TOKEN`: the storage SAS token for the Azure Blob
- `AZURE_CONFIG_FILE`: an alternative location for the config file

**Bonus**
Additionally, PR supports following S3 uri forms:
- `s3://{bucket}/key`
- `s3a://{bucket}/key`
- `https://s3.amazonaws.com/{bucket}/key`
- `https://{bucket}.s3.amazonaws.com/key`

Closes #50
  • Loading branch information
aykut-bozkurt committed Nov 28, 2024
1 parent fb34b7d commit 0a3281f
Show file tree
Hide file tree
Showing 10 changed files with 557 additions and 56 deletions.
8 changes: 8 additions & 0 deletions .devcontainer/.env
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@ AWS_S3_TEST_BUCKET=testbucket
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin

# Azure Blob tests
AZURE_STORAGE_ACCOUNT=devstoreaccount1
AZURE_STORAGE_KEY="Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://localhost:10000/devstoreaccount1;"
AZURE_TEST_CONTAINER_NAME=testcontainer
AZURE_TEST_READ_ONLY_SAS="se=2100-05-05&sp=r&sv=2022-11-02&sr=c&sig=YMPFnAHKe9y0o3hFegncbwQTXtAyvsJEgPB2Ne1b9CQ%3D"
AZURE_TEST_READ_WRITE_SAS="se=2100-05-05&sp=rcw&sv=2022-11-02&sr=c&sig=TPz2jEz0t9L651t6rTCQr%2BOjmJHkM76tnCGdcyttnlA%3D"

# Others
RUST_TEST_THREADS=1
PG_PARQUET_TEST=true
5 changes: 5 additions & 0 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ RUN apt-get update && apt-get -y install build-essential libreadline-dev zlib1g-
curl lsb-release ca-certificates gnupg sudo git \
nano net-tools awscli

# install azure-cli
RUN curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | tee /etc/apt/keyrings/microsoft.gpg > /dev/null
RUN echo "deb [arch=`dpkg --print-architecture` signed-by=/etc/apt/keyrings/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ `lsb_release -cs` main" | tee /etc/apt/sources.list.d/azure-cli.list
RUN apt-get update && apt-get install -y azure-cli

# install Postgres
RUN sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
RUN wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
Expand Down
2 changes: 2 additions & 0 deletions .devcontainer/create-test-buckets.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/bin/bash

aws --endpoint-url http://localhost:9000 s3 mb s3://$AWS_S3_TEST_BUCKET

az storage container create -n $AZURE_TEST_CONTAINER_NAME --connection-string $AZURE_STORAGE_CONNECTION_STRING
17 changes: 16 additions & 1 deletion .devcontainer/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,16 @@ services:
- ${USERPROFILE}${HOME}/.ssh:/home/rust/.ssh:ro
- ${USERPROFILE}${HOME}/.ssh/known_hosts:/home/rust/.ssh/known_hosts:rw
- ${USERPROFILE}${HOME}/.gitconfig:/home/rust/.gitconfig:ro
- ${USERPROFILE}${HOME}/.aws:/home/rust/.aws:ro
- ${USERPROFILE}${HOME}/.aws:/home/rust/.aws:rw
- ${USERPROFILE}${HOME}/.azure:/home/rust/.azure:rw

env_file:
- .env
cap_add:
- SYS_PTRACE
depends_on:
- minio
- azurite

minio:
image: minio/minio
Expand All @@ -30,3 +33,15 @@ services:
interval: 6s
timeout: 2s
retries: 3

azurite:
image: mcr.microsoft.com/azure-storage/azurite
env_file:
- .env
network_mode: host
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "http://localhost:10000"]
interval: 6s
timeout: 2s
retries: 3
16 changes: 16 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,11 @@ jobs:
postgresql-client-${{ env.PG_MAJOR }} \
libpq-dev
- name: Install azure-cli
run: |
curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | sudo tee /etc/apt/keyrings/microsoft.gpg > /dev/null
echo "deb [arch=`dpkg --print-architecture` signed-by=/etc/apt/keyrings/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ `lsb_release -cs` main" | sudo tee /etc/apt/sources.list.d/azure-cli.list
sudo apt-get update && sudo apt-get install -y azure-cli
- name: Install and configure pgrx
run: |
Expand Down Expand Up @@ -116,6 +121,17 @@ jobs:
aws --endpoint-url http://localhost:9000 s3 mb s3://$AWS_S3_TEST_BUCKET
- name: Start Azurite for Azure Blob Storage emulator tests
run: |
docker run -d --env-file .devcontainer/.env -p 10000:10000 mcr.microsoft.com/azure-storage/azurite
while ! nc -z localhost 10000; do
echo "Waiting for localhost:10000..."
sleep 1
done
az storage container create -n $AZURE_TEST_CONTAINER_NAME --connection-string $AZURE_STORAGE_CONNECTION_STRING
- name: Run tests
run: |
# Run tests with coverage tool
Expand Down
38 changes: 38 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ arrow-schema = {version = "53", default-features = false}
aws-config = { version = "1.5", default-features = false, features = ["rustls"]}
aws-credential-types = {version = "1.2", default-features = false}
futures = "0.3"
object_store = {version = "0.11", default-features = false, features = ["aws"]}
home = "0.5"
object_store = {version = "0.11", default-features = false, features = ["aws", "azure"]}
once_cell = "1"
parquet = {version = "53", default-features = false, features = [
"arrow",
Expand All @@ -38,6 +39,7 @@ parquet = {version = "53", default-features = false, features = [
"object_store",
]}
pgrx = "=0.12.8"
rust-ini = "0.21"
tokio = {version = "1", default-features = false, features = ["rt", "time", "macros"]}
url = "2"

Expand Down
38 changes: 34 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,13 @@ SELECT uri, encode(key, 'escape') as key, encode(value, 'escape') as value FROM
```

## Object Store Support
`pg_parquet` supports reading and writing Parquet files from/to `S3` object store. Only the uris with `s3://` scheme is supported.
`pg_parquet` supports reading and writing Parquet files from/to `S3` and `Azure Blob Storage` object stores.

> [!NOTE]
> To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user.
> Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user.
#### S3 Storage

The simplest way to configure object storage is by creating the standard `~/.aws/credentials` and `~/.aws/config` files:

Expand All @@ -179,9 +185,33 @@ Alternatively, you can use the following environment variables when starting pos
- `AWS_CONFIG_FILE`: an alternative location for the config file
- `AWS_PROFILE`: the name of the profile from the credentials and config file (default profile name is `default`)

> [!NOTE]
> To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user.
> Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user.
Supported S3 uri formats are shown below:
- s3:// \<bucket\> / \<path\>
- s3a:// \<bucket\> / \<path\>
- https:// \<bucket\>.s3.amazonaws.com / \<path\>
- https:// s3.amazonaws.com / \<bucket\> / \<path\>

#### Azure Blob Storage

The simplest way to configure object storage is by creating the standard [`~/.azure/config`](https://learn.microsoft.com/en-us/cli/azure/azure-cli-configuration?view=azure-cli-latest) file:

```bash
$ cat ~/.azure/config
[storage]
account = devstoreaccount1
key = Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==
```

Alternatively, you can use the following environment variables when starting postgres to configure the Azure Blob Storage client:
- `AZURE_STORAGE_ACCOUNT`: the storage account name of the Azure Blob
- `AZURE_STORAGE_KEY`: the storage key of the Azure Blob
- `AZURE_STORAGE_SAS_TOKEN`: the storage SAS token for the Azure Blob
- `AZURE_CONFIG_FILE`: an alternative location for the config file

Supported Azure Blob Storage uri formats are shown below:
- az:// \<container\> / \<path\>
- azure:// \<container\> / \<path\>
- https:// \<account\>.blob.core.windows.net / \<container\> / \<path\>

## Copy Options
`pg_parquet` supports the following options in the `COPY TO` command:
Expand Down
Loading

0 comments on commit 0a3281f

Please sign in to comment.