Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureStorageFileSystem Directory Exists not implemented #50

Open
1 task done
patialashahi31 opened this issue Mar 8, 2024 · 8 comments
Open
1 task done

AzureStorageFileSystem Directory Exists not implemented #50

patialashahi31 opened this issue Mar 8, 2024 · 8 comments

Comments

@patialashahi31
Copy link

What happens?

duckdb.duckdb.NotImplementedException: Not implemented Error: AzureStorageFileSystem: DirectoryExists is not implemented!

Facing while copying the duckdb table to azure

To Reproduce

Just while copying the table it will produce

OS:

Ubuntu

DuckDB Version:

0.10.0

DuckDB Client:

Python

Full Name:

Tejinderpal Singh

Affiliation:

Atlan

Have you tried this on the latest nightly build?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have
@patialashahi31 patialashahi31 changed the title AzureFileSystem Directory Exists not implemented AzureStorageFileSystem Directory Exists not implemented Mar 8, 2024
@szarnyasg szarnyasg transferred this issue from duckdb/duckdb Mar 11, 2024
@quentingodeau
Copy link
Contributor

Hello yes at the moment there is some feature that are not yet available.
This one for example is not implemented because we cannot yet implement it the signature of the method do not have the fileopener so we cannot access some context information that are required by the extension. I will try to see if I can make this changes.
Nevertheless the notion of directory doesn't make sens with blob storage account. It can with dfs but for blob I think it will always return false :(

@quentingodeau
Copy link
Contributor

Hello, to keep you update regarding this issue, the long story is available here, the short one the the duck team will change the API of the duckdb FileSystem class that have a lot of impact on a lot of extensions it will take sometime but it will arrive :)

@shaunakv1
Copy link

I am getting same error trying to write a hived geoparquet to azure blob. Is this currently not possible? or Am i missing something?

write_query = f"""
    COPY
        (
            SELECT *,
                    ST_Point(longitude, latitude) AS geom,
                    year(base_date_time) AS year,
                    month(base_date_time) AS month
            FROM read_csv('az://ais/ais2019/csv2/ais-2019-01-*.csv.zst', ignore_errors = true)
        )
    TO 'abfs://ais/parquet' (
            FORMAT PARQUET, 
            COMPRESSION ZSTD, 
            ROW_GROUP_SIZE 122_880, 
            PARTITION_BY (year, month)
    );
"""

@samansmink
Copy link
Collaborator

Azure writes are not yet supported unfortunately

@shaunakv1
Copy link

@samansmink this comment and the following one on another issue made it seem like it works, that what got me confused.

#44 (comment)

@shaunakv1
Copy link

@samansmink In the mean while, I am considering using rclone to first generate the hive parquet locally and then sync it over. However we are working with many TBs worth of data that we have to keep updated.

Is there any way that while writing the hive locally I can get the progress/callback as each partition is written so I can just sync that over? In theory I can just sync the entire directory structure, but with the volume of the data, I will never have the entire hive locally ( space constraints). Here's what I want to achieve.

  1. Write partition ( CSVs are in glob pattern and one can generate multiple parquet files)
  2. Sync it over
  3. Delete it from local
  4. Loop to next partition write

@samansmink
Copy link
Collaborator

@shaunakv1 the comment you link uses fsspec which is separate from the DuckDB Azure Extension and is python-only

@shaunakv1
Copy link

shaunakv1 commented Jan 10, 2025

@samansmink I am using the same. Here's my full code and I still get the same error:

import duckdb
from dotenv import load_dotenv
import os
from fsspec import filesystem

load_dotenv()

AIS_SRC_CONNECTION_STRING = os.getenv("AIS_SRC_CONNECTION_STRING")
AIS_DEST_CONNECTION_STRING = os.getenv("AIS_DEST_CONNECTION_STRING")

duckdb.register_filesystem(
    filesystem("abfs", connection_string=AIS_DEST_CONNECTION_STRING)
)
con = duckdb.connect()

con.install_extension("azure")
con.load_extension("azure")

con.install_extension("spatial")
con.load_extension("spatial")

con.install_extension("h3", repository="community")
con.load_extension("h3")


### Create secret
create_secret = f"""    
    CREATE SECRET ais_src (
    TYPE AZURE,
    CONNECTION_STRING '{AIS_SRC_CONNECTION_STRING}'
    );
"""
con.sql(create_secret)

### configure Duckdb performance params for azure
con.sql("SET azure_http_stats = true;")
con.sql("SET azure_read_transfer_concurrency = 8;")
con.sql("SET azure_read_transfer_chunk_size = 1_048_576;")
con.sql("SET azure_read_buffer_size = 1_048_576;")

count_query = f"""
    SELECT *
    FROM 'az://<redacted>/ais-2019-01-01.csv.zst'
    LIMIT 10
"""
con.sql(count_query).show()

print(f"Writing to parquet...")

write_query = f"""
    COPY
        (
            SELECT *,
                    ST_Point(longitude, latitude) AS geom,
                    year(base_date_time) AS year,
                    month(base_date_time) AS month
            FROM read_csv('az://<redacted>/ais-2019-01-*.csv.zst', ignore_errors = true)
        )
    TO 'abfs://ais/parquet' (
            FORMAT PARQUET, 
            COMPRESSION ZSTD, 
            ROW_GROUP_SIZE 122_880, 
            PARTITION_BY (year, month)
    );
"""

con.sql(write_query).show()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants