tbl.append(df): schema validation of tbl & df during compares the order & data types #1088

sivaraman-ai · 2024-08-22T11:16:18Z

Apache Iceberg version

0.6.1

Please describe the bug 🐞

while writing dataframe to iceberg through tbl.append(df), there happens to be a schema validation of table schema & df schema.

this function in append _check_schema_compatible(self.schema(), other_schema=df.schema) does the schema validation.

here table schema & df schema are converted to pyarrow schema of struct type, and compared with order of dataframe columns with data types.

this results in the following error:
Traceback (most recent call last): File "/Users/apple/Projects/bright/brightmoney_collections_system/utils/index.py", line 172, in <module> dff = write_to_iceberg( File "/Users/apple/Projects/bright/brightmoney_collections_system/utils/index.py", line 163, in write_to_iceberg table.append(pyarrow_df) File "/Users/apple/Projects/bright/brightmoney_collections_system/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 1057, in append _check_schema_compatible(self.schema(), other_schema=df.schema) File "/Users/apple/Projects/bright/brightmoney_collections_system/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 175, in _check_schema_compatible raise ValueError(f"Mismatch in fields:\n{console.export_text()}") ValueError: Mismatch in fields: ┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃ Table field ┃ Dataframe field ┃ ┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ ✅ │ 1: a: optional timestamptz │ 1: a: optional timestamptz │ │ ✅ │ 2: b: optional timestamptz │ 2: b: optional timestamptz │ │ ✅ │ 3: x: optional string │ 3: x: optional string │ │ ✅ │ 4: y: optional string │ 4: y: optional string │ └────┴─────────────────────────────────────────┴─────────────────────────────────────────┘

yet there is no mismatch in field of table & dataframe.

ideally the schema compatibility should not consider the order in which dataframe is send?

The text was updated successfully, but these errors were encountered:

sivaraman-ai · 2024-08-22T11:22:47Z

when digging deeper, this condition compares the struct with order

this condition checks the schema order & data types as struct

if table_schema.as_struct() != task_schema.as_struct()

if the dataframe which is send to append don't have the columns in order w.r.t to the schema table, write fails because the struct turns about to be this

table schema - struct<1: a: optional timestamptz, 2: b: optional timestamptz, 3: x: optional string, 4: y: optional string>
(table columns in this order a, b,x,y)
dataframe schema - struct<1: a: optional timestamptz, 2: b: optional timestamptz, y: optional string, 3: x: optional string, 4:>
(dataframe columns in this order a,b,y,z)

I think schema validation can be applied to data types of columns instead of order or error message could be more helpful mismatch of fields doesn't make sense here?

thanks

sungwy · 2024-08-22T14:55:40Z

Hi @sivaraman-ai - this was fixed in 0.7.x. Could you try using a newer version of PyIceberg? #921

The latest release is 0.7.1

sivaraman-ai · 2024-08-27T11:22:36Z

Hi @sungwy, thanks

will check with the latest version

kevinjqliu · 2024-08-31T13:31:01Z

We improved _check_schema_compatible since 0.6.1 (see #921)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tbl.append(df): schema validation of tbl & df during compares the order & data types #1088

tbl.append(df): schema validation of tbl & df during compares the order & data types #1088

sivaraman-ai commented Aug 22, 2024 •

edited

Loading

sivaraman-ai commented Aug 22, 2024

sungwy commented Aug 22, 2024

sivaraman-ai commented Aug 27, 2024

kevinjqliu commented Aug 31, 2024

tbl.append(df): schema validation of tbl & df during compares the order & data types #1088

tbl.append(df): schema validation of tbl & df during compares the order & data types #1088

Comments

sivaraman-ai commented Aug 22, 2024 • edited Loading

Apache Iceberg version

Please describe the bug 🐞

sivaraman-ai commented Aug 22, 2024

sungwy commented Aug 22, 2024

sivaraman-ai commented Aug 27, 2024

kevinjqliu commented Aug 31, 2024

sivaraman-ai commented Aug 22, 2024 •

edited

Loading