Skip to content

gdubya/duckdb_delta

 
 

Repository files navigation

DuckDB Delta Extension

This is the experimental DuckDB extension for Delta. It is built using the (also experimental) Delta Kernel. The extension (currently) offers read support for delta tables, both local and remote.

Supported platforms

The supported platforms are:

  • linux_amd64 and linux_amd64_gcc4 and linux_arm64
  • osx_amd64 and osx_arm64
  • windows_amd64

Support for the other DuckDB platforms is work-in-progress

How to use

Note

This extension requires the DuckDB v0.10.3 or higher

This extension is distributed as a binary extension. To use it, simply use one of its functions from DuckDB and the extension will be autoloaded:

FROM delta_scan('s3://some/delta/table');

To scan a local table, use the full path prefixes with file://

FROM delta_scan('file:///some/path/on/local/machine');

Cloud Storage authentication

Note that using DuckDB Secrets for Cloud authentication is supported.

S3 Example

CREATE SECRET (
  TYPE S3,
  PROVIDER CREDENTIAL_CHAIN
);
FROM delta_scan('s3://some/delta/table/with/auth');

Azure Example

CREATE SECRET (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'cli',
    ACCOUNT_NAME 'mystorageaccount'
);
FROM delta_scan('abfss://some/delta/table/with/auth');

GCS Example

https://duckdb.org/docs/guides/network_cloud_storage/gcs_import.html You need to create HMAC keys and declare a secret.

CREATE SECRET (
    TYPE GCS,
    KEY_ID 'xxxx',
    SECRET 'yyy'
);

Features

While still experimental, many (scanning) features/optimizations are already supported in this extension as it reuses most of DuckDB's regular parquet scanning logic:

  • multithreaded scans and parquet metadata reading
  • data skipping/filter pushdown
    • skipping row-groups in file (based on parquet metadata)
    • skipping complete files (based on delta partition info)
  • projection pushdown
  • scanning tables with deletion vectors
  • all primitive types
  • structs
  • Cloud storage (AWS, Azure, GCP) support with secrets

More features coming soon!

Building

See the Extension Template for generic build instructions

Running tests

There are various tests available for the delta extension:

  1. Delta Acceptence Test (DAT) based tests in /test/sql/dat
  2. delta-kernel-rs based tests in /test/sql/delta_kernel_rs
  3. Generated data based tests in tests/sql/generated (generated using delta-rs, PySpark, and DuckDB)

To run the first 2 sets of tests:

make test_debug

or in release mode

make test

To also run the tests on generated data:

make generate-data
GENERATED_DATA_AVAILABLE=1 make test

About

DuckDB extension for Delta Lake

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 78.9%
  • Python 10.7%
  • CMake 5.7%
  • Makefile 2.6%
  • Shell 2.1%