On write operation, cast data to Iceberg Table's pyarrow schema #523

kevinjqliu · 2024-03-14T04:26:05Z

This PR resolves #520 by casting the incoming pyarrow Table to the same Schema as the Iceberg Table. This is safe since we first check for schema compatibility.

Renamed _check_schema -> _check_schema_compatible.
Updated PR to use Add as_arrow() to Schema class #532, Schema.as_arrow().
Added test for downcast schema

Fokko

This looks good to me, I left one comment but we can also defer that to a later PR.

Fokko · 2024-03-25T20:08:29Z

pyiceberg/table/__init__.py

-        _check_schema(self.schema(), other_schema=df.schema)
+        _check_schema_compatible(self.schema(), other_schema=df.schema)
+        # the two schemas are compatible so safe to cast
+        df = df.cast(self.schema().as_arrow())


Should _check_schema_compatible return a bool to indicate if the cast is needed? I'm not sure how costly the cast is. If we go from string to large_string then we might rewrite the Arrow buffers.

yea, I like that idea.
_check_schema_compatible returns a boolean should_cast.

If schema is exactly the same, return False and skip cast

If schema is "compatible", return True and cast

If schema is not "compatible", throws an error

It was too complicated when _check_schema_compatible returned a boolean and threw an error.
I ended up doing an extra comparison as Arrow schemas outside and cast only if necessary

Fokko

Thanks for working on this @kevinjqliu

Fokko · 2024-03-28T08:01:16Z

pyiceberg/table/__init__.py

-        _check_schema(self.schema(), other_schema=df.schema)
+        _check_schema_compatible(self.schema(), other_schema=df.schema)
+        # cast if the two schemas are compatible but not equal
+        if self.schema().as_arrow() != df.schema:


nit: It would be good to call as_arrow() just once in case we need to cast.

Backport to 0.6.1

…ma (#559) * Cast data to Iceberg Table's pyarrow schema (#523) Backport to 0.6.1 * use schema_to_pyarrow directly for backporting * remove print in test --------- Co-authored-by: Kevin Liu <[email protected]>

kevinjqliu mentioned this pull request Mar 14, 2024

Handle Arrow large_string data type #520

Closed

kevinjqliu mentioned this pull request Mar 24, 2024

Pyarrow type error #541

Closed

kevinjqliu added 2 commits March 24, 2024 12:37

cast to pyarrow schema

3dcc344

use Schema.as_arrow()

05e7444

kevinjqliu force-pushed the kevinjqliu/pyarrow-data-type branch from 24e7da0 to 05e7444 Compare March 24, 2024 19:38

kevinjqliu added 5 commits March 24, 2024 16:20

also for append

d231dbc

_check_schema_compatible

5b553ab

comment

5103d8a

use .as_arrow()

f565dc8

add test for downcast schema

1d6a08c

kevinjqliu changed the title ~~[wip] cast to pyarrow schema~~ On write operation, cast data to Iceberg Table's pyarrow schema Mar 25, 2024

kevinjqliu marked this pull request as ready for review March 25, 2024 16:45

Fokko approved these changes Mar 25, 2024

View reviewed changes

cast only when necessary

6c7ca99

kevinjqliu requested a review from Fokko March 26, 2024 03:22

Fokko reviewed Mar 28, 2024

View reviewed changes

Fokko merged commit 4c1cfdc into apache:main Mar 28, 2024
7 checks passed

kevinjqliu deleted the kevinjqliu/pyarrow-data-type branch March 28, 2024 17:57

Fokko added this to the PyIceberg 0.6.1 milestone Mar 28, 2024

HonahX pushed a commit to HonahX/iceberg-python that referenced this pull request Mar 29, 2024

Cast data to Iceberg Table's pyarrow schema (apache#523)

56899e6

Backport to 0.6.1

HonahX mentioned this pull request Mar 29, 2024

[0.6.x] Backport PR #523 to cast data to iceberg table's pyarrow schema #559

Merged

HonahX pushed a commit to HonahX/iceberg-python that referenced this pull request Mar 31, 2024

Cast data to Iceberg Table's pyarrow schema (apache#523)

e24541b

Backport to 0.6.1

kevinjqliu mentioned this pull request Mar 31, 2024

Minor fixes, #523 followup #563

Merged

HonahX pushed a commit that referenced this pull request Mar 31, 2024

Minor fixes, #523 followup (#563)

7e3e508

kevinjqliu mentioned this pull request May 6, 2024

ValueError: Mismatch in fields: ? #674

Closed

Fokko mentioned this pull request May 9, 2024

Compare Schema and StructType fields irrespective of ordering #700

Closed

kevinjqliu mentioned this pull request Jun 19, 2024

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On write operation, cast data to Iceberg Table's pyarrow schema #523

On write operation, cast data to Iceberg Table's pyarrow schema #523

kevinjqliu commented Mar 14, 2024 •

edited

Loading

Fokko left a comment

Fokko Mar 25, 2024

kevinjqliu Mar 25, 2024

kevinjqliu Mar 26, 2024

Fokko left a comment

Fokko Mar 28, 2024

On write operation, cast data to Iceberg Table's pyarrow schema #523

On write operation, cast data to Iceberg Table's pyarrow schema #523

Conversation

kevinjqliu commented Mar 14, 2024 • edited Loading

Fokko left a comment

Choose a reason for hiding this comment

Fokko Mar 25, 2024

Choose a reason for hiding this comment

kevinjqliu Mar 25, 2024

Choose a reason for hiding this comment

kevinjqliu Mar 26, 2024

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko Mar 28, 2024

Choose a reason for hiding this comment

kevinjqliu commented Mar 14, 2024 •

edited

Loading