Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duckdb iceberg_scan can read directly from metadata.json #1

Closed
kevinjqliu opened this issue Jul 8, 2024 · 4 comments
Closed

Duckdb iceberg_scan can read directly from metadata.json #1

kevinjqliu opened this issue Jul 8, 2024 · 4 comments

Comments

@kevinjqliu
Copy link

Thanks for the great article for Iceberg + DuckDB https://www.definite.app/blog/iceberg-query-engine

I just want to mention that Duckdb's iceberg_scan can read Iceberg table using its metadata.json file directly.
Something like

metadata_location = table['metadata_location']
f"""
CREATE VIEW taxi.trips AS
SELECT * FROM iceberg_scan('{metadata_location}', allow_moved_paths = true);
"""

So you can skip creating version-hint.text files entirely.
Furthermore, Pyiceberg should always return a table's latest metadata location.

@kevinjqliu
Copy link
Author

Nevermind, I dont think the above will work currently. Originally based the above on this comment duckdb/duckdb-iceberg#29 (comment)

Looks like there are a few other issues with iceberg_scan. Likely it doesn't play nice with local file file://

Here's the part I tried to change

# initiate a duckdb connection which we will use to be the query engine for iceberg
con = duckdb.connect(database=':memory:', read_only=False)
setup_sql = '''
INSTALL iceberg;
LOAD iceberg;
'''
res = con.execute(setup_sql)

trips_iceberg_table = catalog.load_table(f"{name_space}.trips")
trips_metadata_location = trips_iceberg_table.metadata_location
print(trips_metadata_location)

# create the schema and views of iceberg tables in duckdb
database_path = f'{warehouse_path}/demo_db.db'

create_view_sql = f'''
CREATE SCHEMA IF NOT EXISTS taxi;

CREATE VIEW taxi.trips AS
SELECT * FROM iceberg_scan('{trips_metadata_location}');
'''

con.execute(create_view_sql)

Output and Error, (I change the warehouse path to /tmp/ warehouse_path = "/tmp/")

file:///tmp/demo_db.db/trips/metadata/00001-153f7be7-6f68-44f3-85c0-3bcfb92bd055.metadata.json
---------------------------------------------------------------------------
IOException                               Traceback (most recent call last)
Cell In[20], line 23
     14 database_path = f'{warehouse_path}/demo_db.db'
     16 create_view_sql = f'''
     17 CREATE SCHEMA IF NOT EXISTS taxi;
     18 
     19 CREATE VIEW taxi.trips AS
     20 SELECT * FROM iceberg_scan('{trips_metadata_location}');
     21 '''
---> 23 con.execute(create_view_sql)

IOException: IO Error: Cannot open file "file:///tmp/demo_db.db/trips/metadata/00001-153f7be7-6f68-44f3-85c0-3bcfb92bd055.metadata.json": No such file or directory

@kevinjqliu
Copy link
Author

Possibly related duckdb/duckdb-iceberg#38

Anyways, thanks for the great article!

@mike-luabase
Copy link

@kevinjqliu yeah, we had issues reading metadata.json directly. I don't like the version-hint.text hack either, but seems like the best solution at the moment.

@kevinjqliu
Copy link
Author

Here's what worked for me.
https://gist.github.com/kevinjqliu/b5da13e6fed0b17ec52ed43b09a25ed9

I'm using the docker container used for pyiceberg integration tests (make test-integration) which comes with minio setup.

I think the relevant changes are the in duckdb

set s3_endpoint='localhost:9000';
SET s3_url_style = 'path';
SET s3_use_ssl = false;
set s3_access_key_id='admin';
set s3_secret_access_key='password';

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants