Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

kevinjqliu · 2024-06-18T18:34:02Z

Fixes #674

This PR does the following

Fix test for writing a subset of the table schema (Test, write subset of schema #704 did not include the test_ prefix)
Add schema projection when writing to a dataframe who's schema is a subset of the table's schema
Add test for writing table schema in arbitrary order
Remove casting pyarrow dataframe schema (Originally introduced in On write operation, cast data to Iceberg Table's pyarrow schema #523, Cast PyArrow schema to large_* types #807 is the proper fix)
Rewrite the _check_schema_compatible function

tests/integration/test_writes/test_writes.py

pyiceberg/table/__init__.py

kevinjqliu · 2024-07-08T16:55:32Z

r? @Fokko / @HonahX

Fokko

First of all, sorry for the late reply. Feel free to ping me more aggressively.

This initial version looks good, thanks for working on this. One important thing I would love to see fixed in here as well. How about re-aligning the table before we write, otherwise we have to do all of this when reading. Most tables have far fewer writes than reads, so it is good to optimize for reads. I was hoping to re-use to_requested_schema for this. WDYT?

pyiceberg/table/__init__.py

Fokko · 2024-07-08T18:23:22Z

pyiceberg/table/__init__.py

-    Two schemas are considered compatible when they are equal in terms of the Iceberg Schema type.
+    The schemas are compatible if:
+    - All fields in `other_schema` are present in `table_schema`. (other_schema <= table_schema)
+    - All required fields in `table_schema` are present in `other_schema`.


Just a heads up, with V3 this changes since it is allowed to add required fields with a default value.

Fokko · 2024-07-08T18:33:02Z

pyiceberg/table/__init__.py

    except ValueError as e:
        other_schema = _pyarrow_to_schema_without_ids(other_schema)
        additional_names = set(other_schema.column_names) - set(table_schema.column_names)
        raise ValueError(
            f"PyArrow table contains more columns: {', '.join(sorted(additional_names))}. Update the schema first (hint, use union_by_name)."
        ) from e

-    if table_schema.as_struct() != task_schema.as_struct():
+    fields_missing_from_table = {field for field in other_schema.fields if field not in table_schema.fields}


Can you check if it handles nested structs as well?

kevinjqliu · 2024-07-09T00:44:05Z

First of all, sorry for the late reply. Feel free to ping me more aggressively.

No worries at all, I forgot to ping about this PR

How about re-aligning the table before we write, otherwise we have to do all of this when reading. Most tables have far fewer writes than reads, so it is good to optimize for reads.

Can you talk a bit more about "re-aligning"? Is it to match the parquet schema with that of Iceberg's?
I see that to_requested_schema is currently used to coerce the data before it is written to parquet.
https://github.com/apache/iceberg-python/blame/7dff359e0515839fbe24fac2108dcb2d64694b7a/pyiceberg/io/pyarrow.py#L1915-L1918

Is the idea to do so for the entire arrow table before writing? If so, maybe we can push the to_requested_schema up the stack and simplify write_parquet. I also mentioned this in #786 (comment)

@Fokko

pdpark · 2024-07-09T16:14:39Z

I still get the Mismatch in fields error when calling append with this fix. The problem appears to be that one of the schemas after conversion has a doc field and the other schema does not. (Hacked code removed - not relevant)

kevinjqliu · 2024-07-09T16:17:16Z

@pdpark can you share the iceberg table schema and the pyarrow table schema?

pdpark · 2024-07-09T16:33:12Z

I can't share the schemas, but it's just a few fields with simple string and datetime data types. The names and data types are exactly the same in both schemas and in the same order. The only difference I could find was that the NestedField(s) of one of the schemas had a doc field defined while the other schema did not. Converting the schema fields to a string with the Python str function fixes the issue because the NestedField __str__ fixes the missing/empty doc field:

    def __str__(self) -> str:
        """Return the string representation of the NestedField class."""
        doc = "" if not self.doc else f" ({self.doc})"
        req = "required" if self.required else "optional"
        return f"{self.field_id}: {self.name}: {req} {self.field_type}{doc}"

BTW: this function made it difficult to debug because printing the schemas invoked__str__ so the doc fields looked identical in the output. I had to use the pydantic model_dump function to see the difference.

pdpark · 2024-07-09T17:09:19Z

FYI: here's the model_dump of the two schemas - only showing the first field:

table_schema: {'type': 'struct', 'fields': ({'id': 1, 'name': 'id', 'type': 'string', 'required': False, 'doc': ''}, ...), 'schema-id': 0, 'identifier-field-ids': []}

task_schema: {'type': 'struct', 'fields': ({'id': 1, 'name': 'id', 'type': 'string', 'required': False}, ...), 'schema-id': 0, 'identifier-field-ids': []}

Note that the table_schema field has a doc (pydantic) field, but the corresponding task_schema field does not.

Fokko · 2024-07-10T08:26:28Z

Can you talk a bit more about "re-aligning"?

Let me give two examples:

Out of order

table {
  1: str foo
  2: int bar
}

It is fine to write a parquet file to this table with:

table {
  2: int bar
  1: str foo
}

When the table is being read, the columns are re-ordered by to_requested_schema.

Casting

The same goes for casting:

table {
  1: str foo
  2: long bar
}

It is fine to write:

table {
  1: str foo
  2: int bar
}

The upcasting to a long will be done when the data is being read, but it is less efficient since we first let Arrow read the data as an int, and then it will do the cast to long in to_requested_schema to be able to append the files.

kevinjqliu · 2024-07-12T18:20:41Z

pyiceberg/io/pyarrow.py

@@ -2053,7 +2055,10 @@ def _check_schema_compatible(table_schema: Schema, other_schema: pa.Schema, down
            f"PyArrow table contains more columns: {', '.join(sorted(additional_names))}. Update the schema first (hint, use union_by_name)."
        ) from e

-    if table_schema.as_struct() != task_schema.as_struct():
+    fields_missing_from_table = {field for field in other_schema.fields if field not in table_schema.fields}


this doesn't work for nested structs, need a better solution

kevinjqliu · 2024-07-12T18:31:05Z

pyiceberg/table/__init__.py

@@ -484,10 +484,6 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT)
        _check_schema_compatible(
            self._table.schema(), other_schema=df.schema, downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us
        )
-        # cast if the two schemas are compatible but not equal


@syun64 I want to get your take on this part. Due to the timestamp change, do you know if the df need to be casted?
There are a couple of different parts involved in the write path. In particular, we need to look at the table schema, the df schema, and the df itself. As well as dealing with bin-packing and other transformations.

Happy to extract this convo into an issue, to also continue the convo from #786 (comment)

@syun64 I want to get your take on this part. Due to the timestamp change, do you know if the df need to be casted? There are a couple of different parts involved in the write path. In particular, we need to look at the table schema, the df schema, and the df itself. As well as dealing with bin-packing and other transformations.

I have a PR open to try to fix this behavior: #910 I think it's almost ready to merge 😄

kevinjqliu · 2024-07-17T16:49:38Z

Fixed in #921

kevinjqliu mentioned this pull request Jun 18, 2024

Compare Schema and StructType fields irrespective of ordering #700

Closed

kevinjqliu changed the title ~~Kevinjqliu/674 field mismatch~~ Allow writing dataframes that are either a subset of table schema or in arbitrary order Jun 19, 2024

This was referenced Jun 19, 2024

ValueError: Mismatch in fields: ? #674

Closed

Support Table.to_arrow_batch_reader to return RecordBatchReader instead of a fully materialized Arrow Table #786

Merged

sungwy reviewed Jun 22, 2024

View reviewed changes

tests/integration/test_writes/test_writes.py Show resolved Hide resolved

sungwy reviewed Jun 22, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

sungwy reviewed Jun 22, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

kevinjqliu mentioned this pull request Jun 25, 2024

write UUID fail on _check_schema_compatible #855

Closed

Fokko reviewed Jul 8, 2024

View reviewed changes

kevinjqliu force-pushed the kevinjqliu/674-field-mismatch branch from e87a59a to 19598c3 Compare July 9, 2024 00:44

kevinjqliu added 8 commits July 11, 2024 21:26

remove arrow schema cast

f2949b7

add test for writing out of order schema

397e31e

lint

fd24c56

properly name tests

2c52276

rewrite test_table_write_out_of_order_schema

6a9b612

reimplement _check_schema_compatible

9ee11c2

mark test as integration

84ad3da

merge main

2ce6db3

kevinjqliu force-pushed the kevinjqliu/674-field-mismatch branch from 19598c3 to 2ce6db3 Compare July 12, 2024 18:13

kevinjqliu commented Jul 12, 2024

View reviewed changes

sungwy mentioned this pull request Jul 12, 2024

Allow writing pa.Table that are either a subset of table schema or in arbitrary order, and support type promotion on write #921

Merged

kevinjqliu closed this Jul 17, 2024

kevinjqliu deleted the kevinjqliu/674-field-mismatch branch July 17, 2024 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

kevinjqliu commented Jun 18, 2024 •

edited

Loading

kevinjqliu commented Jul 8, 2024

Fokko left a comment

Fokko Jul 8, 2024

Fokko Jul 8, 2024

kevinjqliu commented Jul 9, 2024

pdpark commented Jul 9, 2024 •

edited

Loading

kevinjqliu commented Jul 9, 2024

pdpark commented Jul 9, 2024 •

edited

Loading

pdpark commented Jul 9, 2024

Fokko commented Jul 10, 2024 •

edited

Loading

kevinjqliu Jul 12, 2024

kevinjqliu Jul 12, 2024

kevinjqliu Jul 12, 2024

sungwy Jul 12, 2024

kevinjqliu commented Jul 17, 2024

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

Conversation

kevinjqliu commented Jun 18, 2024 • edited Loading

kevinjqliu commented Jul 8, 2024

Fokko left a comment

Choose a reason for hiding this comment

Fokko Jul 8, 2024

Choose a reason for hiding this comment

Fokko Jul 8, 2024

Choose a reason for hiding this comment

kevinjqliu commented Jul 9, 2024

pdpark commented Jul 9, 2024 • edited Loading

kevinjqliu commented Jul 9, 2024

pdpark commented Jul 9, 2024 • edited Loading

pdpark commented Jul 9, 2024

Fokko commented Jul 10, 2024 • edited Loading

Out of order

Casting

kevinjqliu Jul 12, 2024

Choose a reason for hiding this comment

kevinjqliu Jul 12, 2024

Choose a reason for hiding this comment

kevinjqliu Jul 12, 2024

Choose a reason for hiding this comment

sungwy Jul 12, 2024

Choose a reason for hiding this comment

kevinjqliu commented Jul 17, 2024

kevinjqliu commented Jun 18, 2024 •

edited

Loading

pdpark commented Jul 9, 2024 •

edited

Loading

pdpark commented Jul 9, 2024 •

edited

Loading

Fokko commented Jul 10, 2024 •

edited

Loading