Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to Delta files stored in ADLS Gen2 #21

Closed
jegranado opened this issue May 28, 2024 · 23 comments
Closed

Add support to Delta files stored in ADLS Gen2 #21

jegranado opened this issue May 28, 2024 · 23 comments

Comments

@jegranado
Copy link

select * FROM delta_scan('abfss://<account>.dfs.core.windows.net/<path>/<delta_table>/');

results in:
IOException: IO Error: Hit DeltaKernel FFI error (from: get_default_client in DeltaScanScanBind): Hit error: 5 (GenericError) with message (Generic delta kernel error: Error interacting with object store: Generic parse_url error: feature for MicrosoftAzure not enabled)

@mrjsj
Copy link

mrjsj commented May 29, 2024

Did you install and load the duckdb Azure extension?

@jegranado
Copy link
Author

Hello,

Yes. I can read .csv files in Azure, but delta_scan fails with the message above.

I can use delta_scan with local Delta tables.

@jegranado
Copy link
Author

just had a look at the code - looks like support for s3:// is hardcoded, so I suspect support for abfss:// would have to follow a similar path.

looking forward for this to be available

@jegranado
Copy link
Author

if I got it correctly...

static ffi::EngineBuilder* CreateBuilder have to be changed to add the abfs:// possiblity by adding that condition to the path parser.

Then, the builder would have to set specific attributes if Azure type secrets exists.
Looking at the duckdb_azure extension codebase, at least the connection_string would have to be setup (depending on which type of Azure secret was found)

I'd love to help myself, but I fall short of C++ skills here

@nfoerster2
Copy link
Contributor

I'm having the same problem, please add the support for azure.

@nfoerster2
Copy link
Contributor

@jegranado I tried to hack it into, build and loaded, however it doesn't change anything, do you have an idea?

https://github.com/nfoerster2/duckdb_delta/blob/a2ddb6c7d65cbae8e7466a4d35c315b2c34b8799/src/functions/delta_scan.cpp#L68

@sid2911
Copy link

sid2911 commented Jun 14, 2024

I'am having the same problem. It works well if I directly query a parquet file within delta table folder, however gives me the same error when I use delta_scan

@fibigerg
Copy link

fibigerg commented Jun 23, 2024

Hi, I tried this with GCS path "gs://" ... And indeed it does not work, while read_parquet works with the same credentials.

However, the error is very strange.

duckdb.duckdb.IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'gs://my-bucket/my-table'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic GCS error: Error performing token request: Error after 10 retries in 6.122011584s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): error trying to connect: tcp connect error: Host is down (os error 64))

@nfoerster2
Copy link
Contributor

if I got it correctly...

static ffi::EngineBuilder* CreateBuilder have to be changed to add the abfs:// possiblity by adding that condition to the path parser.

Then, the builder would have to set specific attributes if Azure type secrets exists. Looking at the duckdb_azure extension codebase, at least the connection_string would have to be setup (depending on which type of Azure secret was found)

I'd love to help myself, but I fall short of C++ skills here

I implemented Azure functionality, however I'm not sure if that is the right way. It uses duckdb Azure plugin and delta rs blob authentication, its complicated as they are both requiring different azure parameters. But its working on Azurite and with Blob access by CLI and connection string/access keys.

https://github.com/nfoerster2/duckdb_delta

@samansmink
Copy link
Collaborator

@nfoerster2 Thanks for that implementation. While I agree with you that this is not ideal due to complexity of using both DuckDB's filesystem and the Kernel's internal filesystems, I think its a good idea to merge this. The alternative would be to wait for delta-kernel-rs to support letting DuckDB fully handle all filesystem ops, but that could take a little while still and having some azure auth methods working seems like low hanging fruit in the mean time

I would propose to PR your code into the delta extension, I will add some CI jobs based on Azurite and Minio to ensure this can actually be tested. Feel free to open a PR with your code, I can review and merge it. Otherwise I will open one with your changes and some testing later this week.

@nfoerster2
Copy link
Contributor

Sure, I can create a PR. I think there are still some bugs. I added two test cases which are working fine, however I also tested some productive data, its a deltalake with around 1TB data and two layers of partitioning (Serialnumber SN as string and YYYYMM as int, so the pattern for one file of deltalake is partition_sn_yyymm_i5m_v15-3/SN=ZZZZ555/yyyymm=202406/blah.parquet, and it failes during partition discovery. I added below but anonymized the data. The Serialnumber column SN is a string, however it tries to interpret it as an int. I think more complex tests are needed.

D SELECT
      *
  FROM
      delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3/')
  WHERE SN='XYZ1234';

Invalid Input Error: Failed to cast value: Could not convert string 'ZZZZ555' to INT32

from delta_log:
{"add":{"path":"SN=8XYZ1337/yyyymm=202405/part-00001-d88a25a9-a3f8-4360-9b12-2e737820fa16-c000.zstd.parquet","partitionValues":{"SN":"8XYZ1337","yyyymm":"202405"},"size":...

Then if just go by second level it takes a huge amount of time, I think it does not push to predicate to the partitions correctly:

SELECT
      SN
  FROM
      delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_interval_5m/')
  where yyyymm=202212;

@samansmink
Copy link
Collaborator

PR is up #39

@mrjsj
Copy link

mrjsj commented Jun 28, 2024

Very nice @samansmink . Been looking forward to this. 😊

@dennis-barrett
Copy link

I'm also encountering this error. @samansmink thanks for the fix, is it in the nightly version of the extension? I'm still getting the error on nightly, so I'm guessing no?

@samansmink
Copy link
Collaborator

@dennis-barrett could you run update extensions in duckdb to confirm you are on latest nightly?

you should see something like:

┌────────────────┬──────────────┬──────────────────────┬──────────────────┬─────────────────┐
│ extension_name │  repository  │    update_result     │ previous_version │ current_version │
│    varchar     │   varchar    │       varchar        │     varchar      │     varchar     │
├────────────────┼──────────────┼──────────────────────┼──────────────────┼─────────────────┤
│ delta          │ core_nightly │ UPDATED              │ 82d2003          │ a392ea5         │
├────────────────┴──────────────┴──────────────────────┴──────────────────┴─────────────────┤
│ 1 row                                                                           5 columns │
└───────────────────────────────────────────────────────────────────────────────────────────┘

@kpapadatos
Copy link

kpapadatos commented Jul 10, 2024

I am getting:

[Error: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'abfss://<account>.dfs.core.windows.net/<path>/<delta_table>/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: URL did not match any known pattern for scheme: abfss://<account>.dfs.core.windows.net/<path>/<delta_table>/)]

Delta extension version is a392ea5 (installed from core_nightly).

@mrjsj
Copy link

mrjsj commented Jul 10, 2024

I am getting:

`

[Error: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'abfss://.dfs.core.windows.net//<delta_table>/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: URL did not match any known pattern for scheme: abfss://.dfs.core.windows.net//<delta_table>/)]

`

Delta extension version is a392ea5 (installed from core_nightly).

Try without account and dfs.core.windows.net.

Just use abfss://container/path/to/table

Make sure you created an Azure secret with the respective storage account name.

@kpapadatos
Copy link

I am getting:
[Error: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'abfss://.dfs.core.windows.net//<delta_table>/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: URL did not match any known pattern for scheme: abfss://.dfs.core.windows.net//<delta_table>/)]
Delta extension version is a392ea5 (installed from core_nightly).

Try without account and dfs.core.windows.net.

Just use abfss://container/path/to/table

Make sure you created an Azure secret with the respective storage account name.

It worked! Thanks!

@dennis-barrett
Copy link

@samansmink yep I've got the latest nightly. @mrjsj's solution worked for me though, thanks!

@samansmink
Copy link
Collaborator

this is now supported with #39

@JangasCodingplace
Copy link

Hm ... it does not work for me:

CREATE SECRET az (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    ACCOUNT_NAME '<MY_ACCOUNT_NAME>'
);

SELECT * FROM delta_scan('az://<CONTAINER_NAME>/<DELTA_PATH>');

duckdb.duckdb.IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'az://<CONTAINER_NAME>/<DELTA_PATH>/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: Error performing token request: Error after 10 retries in 7.158766792s, max_retries:10, retry_timeout:180s,

My Extensions:

┌────────────────┬──────────────┬─────────────────────┬──────────────────┬─────────────────┐
│ extension_name │  repository  │    update_result    │ previous_version │ current_version │
│    varchar     │   varchar    │       varchar       │     varchar      │     varchar     │
├────────────────┼──────────────┼─────────────────────┼──────────────────┼─────────────────┤
│ uc_catalog     │ core_nightly │ NO_UPDATE_AVAILABLE │ fc83d77          │ fc83d77         │
│ delta          │ core         │ NO_UPDATE_AVAILABLE │ f71402e          │ f71402e         │
│ azure          │ core         │ NO_UPDATE_AVAILABLE │ a40ecb7          │ a40ecb7         │
│ motherduck     │ core         │ REDOWNLOADED        │                  │                 │
└────────────────┴──────────────┴─────────────────────┴──────────────────┴─────────────────┘

Any ideas? :/

(I am able to read parquet files from blob storage by using the set secret)

@nfoerster2
Copy link
Contributor

Hm ... it does not work for me:

CREATE SECRET az (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    ACCOUNT_NAME '<MY_ACCOUNT_NAME>'
);

SELECT * FROM delta_scan('az://<CONTAINER_NAME>/<DELTA_PATH>');

duckdb.duckdb.IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'az://<CONTAINER_NAME>/<DELTA_PATH>/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: Error performing token request: Error after 10 retries in 7.158766792s, max_retries:10, retry_timeout:180s,

My Extensions:

┌────────────────┬──────────────┬─────────────────────┬──────────────────┬─────────────────┐
│ extension_name │  repository  │    update_result    │ previous_version │ current_version │
│    varchar     │   varchar    │       varchar       │     varchar      │     varchar     │
├────────────────┼──────────────┼─────────────────────┼──────────────────┼─────────────────┤
│ uc_catalog     │ core_nightly │ NO_UPDATE_AVAILABLE │ fc83d77          │ fc83d77         │
│ delta          │ core         │ NO_UPDATE_AVAILABLE │ f71402e          │ f71402e         │
│ azure          │ core         │ NO_UPDATE_AVAILABLE │ a40ecb7          │ a40ecb7         │
│ motherduck     │ core         │ REDOWNLOADED        │                  │                 │
└────────────────┴──────────────┴─────────────────────┴──────────────────┴─────────────────┘

Any ideas? :/

(I am able to read parquet files from blob storage by using the set secret)

Can you try to set CHAIN as well? Are you trying to access by env or cli?

@JangasCodingplace
Copy link

Hm ... it does not work for me:

CREATE SECRET az (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    ACCOUNT_NAME '<MY_ACCOUNT_NAME>'
);

SELECT * FROM delta_scan('az://<CONTAINER_NAME>/<DELTA_PATH>');

duckdb.duckdb.IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'az://<CONTAINER_NAME>/<DELTA_PATH>/'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic MicrosoftAzure error: Error performing token request: Error after 10 retries in 7.158766792s, max_retries:10, retry_timeout:180s,

My Extensions:

┌────────────────┬──────────────┬─────────────────────┬──────────────────┬─────────────────┐
│ extension_name │  repository  │    update_result    │ previous_version │ current_version │
│    varchar     │   varchar    │       varchar       │     varchar      │     varchar     │
├────────────────┼──────────────┼─────────────────────┼──────────────────┼─────────────────┤
│ uc_catalog     │ core_nightly │ NO_UPDATE_AVAILABLE │ fc83d77          │ fc83d77         │
│ delta          │ core         │ NO_UPDATE_AVAILABLE │ f71402e          │ f71402e         │
│ azure          │ core         │ NO_UPDATE_AVAILABLE │ a40ecb7          │ a40ecb7         │
│ motherduck     │ core         │ REDOWNLOADED        │                  │                 │
└────────────────┴──────────────┴─────────────────────┴──────────────────┴─────────────────┘

Any ideas? :/
(I am able to read parquet files from blob storage by using the set secret)

Can you try to set CHAIN as well? Are you trying to access by env or cli?

Yep, this was the mistake!
I was expecting that default would be sufficient for reading delta files (since it works on parquet files as well). But it's not.

Using CHAIN 'cli'; fixed that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants