-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to Delta files stored in ADLS Gen2 #21
Comments
Did you install and load the duckdb Azure extension? |
Hello, Yes. I can read .csv files in Azure, but I can use |
just had a look at the code - looks like support for looking forward for this to be available |
if I got it correctly...
Then, the I'd love to help myself, but I fall short of C++ skills here |
I'm having the same problem, please add the support for azure. |
@jegranado I tried to hack it into, build and loaded, however it doesn't change anything, do you have an idea? |
I'am having the same problem. It works well if I directly query a parquet file within delta table folder, however gives me the same error when I use delta_scan |
Hi, I tried this with GCS path "gs://" ... And indeed it does not work, while read_parquet works with the same credentials. However, the error is very strange.
|
I implemented Azure functionality, however I'm not sure if that is the right way. It uses duckdb Azure plugin and delta rs blob authentication, its complicated as they are both requiring different azure parameters. But its working on Azurite and with Blob access by CLI and connection string/access keys. |
@nfoerster2 Thanks for that implementation. While I agree with you that this is not ideal due to complexity of using both DuckDB's filesystem and the Kernel's internal filesystems, I think its a good idea to merge this. The alternative would be to wait for delta-kernel-rs to support letting DuckDB fully handle all filesystem ops, but that could take a little while still and having some azure auth methods working seems like low hanging fruit in the mean time I would propose to PR your code into the delta extension, I will add some CI jobs based on Azurite and Minio to ensure this can actually be tested. Feel free to open a PR with your code, I can review and merge it. Otherwise I will open one with your changes and some testing later this week. |
Sure, I can create a PR. I think there are still some bugs. I added two test cases which are working fine, however I also tested some productive data, its a deltalake with around 1TB data and two layers of partitioning (Serialnumber SN as string and YYYYMM as int, so the pattern for one file of deltalake is partition_sn_yyymm_i5m_v15-3/SN=ZZZZ555/yyyymm=202406/blah.parquet, and it failes during partition discovery. I added below but anonymized the data. The Serialnumber column SN is a string, however it tries to interpret it as an int. I think more complex tests are needed.
from delta_log: Then if just go by second level it takes a huge amount of time, I think it does not push to predicate to the partitions correctly:
|
PR is up #39 |
Very nice @samansmink . Been looking forward to this. 😊 |
I'm also encountering this error. @samansmink thanks for the fix, is it in the nightly version of the extension? I'm still getting the error on nightly, so I'm guessing no? |
@dennis-barrett could you run you should see something like:
|
I am getting:
Delta extension version is |
Try without account and dfs.core.windows.net. Just use abfss://container/path/to/table Make sure you created an Azure secret with the respective storage account name. |
It worked! Thanks! |
@samansmink yep I've got the latest nightly. @mrjsj's solution worked for me though, thanks! |
this is now supported with #39 |
Hm ... it does not work for me: CREATE SECRET az (
TYPE AZURE,
PROVIDER CREDENTIAL_CHAIN,
ACCOUNT_NAME '<MY_ACCOUNT_NAME>'
);
SELECT * FROM delta_scan('az://<CONTAINER_NAME>/<DELTA_PATH>');
My Extensions:
Any ideas? :/ (I am able to read parquet files from blob storage by using the set secret) |
Can you try to set CHAIN as well? Are you trying to access by env or cli? |
Yep, this was the mistake! Using |
select * FROM delta_scan('abfss://<account>.dfs.core.windows.net/<path>/<delta_table>/');
results in:
IOException: IO Error: Hit DeltaKernel FFI error (from: get_default_client in DeltaScanScanBind): Hit error: 5 (GenericError) with message (Generic delta kernel error: Error interacting with object store: Generic parse_url error: feature for MicrosoftAzure not enabled)
The text was updated successfully, but these errors were encountered: