Duckdb `iceberg_scan` can read directly from metadata.json #1

kevinjqliu · 2024-07-08T22:02:22Z

Thanks for the great article for Iceberg + DuckDB https://www.definite.app/blog/iceberg-query-engine

I just want to mention that Duckdb's iceberg_scan can read Iceberg table using its metadata.json file directly.
Something like

metadata_location = table['metadata_location']
f"""
CREATE VIEW taxi.trips AS
SELECT * FROM iceberg_scan('{metadata_location}', allow_moved_paths = true);
"""

So you can skip creating version-hint.text files entirely.
Furthermore, Pyiceberg should always return a table's latest metadata location.

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2024-07-08T22:52:25Z

Nevermind, I dont think the above will work currently. Originally based the above on this comment duckdb/duckdb-iceberg#29 (comment)

Looks like there are a few other issues with iceberg_scan. Likely it doesn't play nice with local file file://

Here's the part I tried to change

# initiate a duckdb connection which we will use to be the query engine for iceberg
con = duckdb.connect(database=':memory:', read_only=False)
setup_sql = '''
INSTALL iceberg;
LOAD iceberg;
'''
res = con.execute(setup_sql)

trips_iceberg_table = catalog.load_table(f"{name_space}.trips")
trips_metadata_location = trips_iceberg_table.metadata_location
print(trips_metadata_location)

# create the schema and views of iceberg tables in duckdb
database_path = f'{warehouse_path}/demo_db.db'

create_view_sql = f'''
CREATE SCHEMA IF NOT EXISTS taxi;

CREATE VIEW taxi.trips AS
SELECT * FROM iceberg_scan('{trips_metadata_location}');
'''

con.execute(create_view_sql)

Output and Error, (I change the warehouse path to /tmp/ warehouse_path = "/tmp/")

file:///tmp/demo_db.db/trips/metadata/00001-153f7be7-6f68-44f3-85c0-3bcfb92bd055.metadata.json
---------------------------------------------------------------------------
IOException                               Traceback (most recent call last)
Cell In[20], line 23
     14 database_path = f'{warehouse_path}/demo_db.db'
     16 create_view_sql = f'''
     17 CREATE SCHEMA IF NOT EXISTS taxi;
     18 
     19 CREATE VIEW taxi.trips AS
     20 SELECT * FROM iceberg_scan('{trips_metadata_location}');
     21 '''
---> 23 con.execute(create_view_sql)

IOException: IO Error: Cannot open file "file:///tmp/demo_db.db/trips/metadata/00001-153f7be7-6f68-44f3-85c0-3bcfb92bd055.metadata.json": No such file or directory

kevinjqliu · 2024-07-08T22:53:33Z

Possibly related duckdb/duckdb-iceberg#38

Anyways, thanks for the great article!

mike-luabase · 2024-07-11T15:57:55Z

@kevinjqliu yeah, we had issues reading metadata.json directly. I don't like the version-hint.text hack either, but seems like the best solution at the moment.

kevinjqliu · 2024-07-11T22:41:05Z

Here's what worked for me.
https://gist.github.com/kevinjqliu/b5da13e6fed0b17ec52ed43b09a25ed9

I'm using the docker container used for pyiceberg integration tests (make test-integration) which comes with minio setup.

I think the relevant changes are the in duckdb

set s3_endpoint='localhost:9000';
SET s3_url_style = 'path';
SET s3_use_ssl = false;
set s3_access_key_id='admin';
set s3_secret_access_key='password';

Cheers.

kevinjqliu mentioned this issue Jul 24, 2024

[feat] Ability to read table using version-hint.txt apache/iceberg-python#763

Open

steven-luabase closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duckdb `iceberg_scan` can read directly from metadata.json #1

Duckdb `iceberg_scan` can read directly from metadata.json #1

kevinjqliu commented Jul 8, 2024

kevinjqliu commented Jul 8, 2024

kevinjqliu commented Jul 8, 2024

mike-luabase commented Jul 11, 2024

kevinjqliu commented Jul 11, 2024

Duckdb iceberg_scan can read directly from metadata.json #1

Duckdb iceberg_scan can read directly from metadata.json #1

Comments

kevinjqliu commented Jul 8, 2024

kevinjqliu commented Jul 8, 2024

kevinjqliu commented Jul 8, 2024

mike-luabase commented Jul 11, 2024

kevinjqliu commented Jul 11, 2024

Duckdb `iceberg_scan` can read directly from metadata.json #1

Duckdb `iceberg_scan` can read directly from metadata.json #1