Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On write operation, cast data to Iceberg Table's pyarrow schema #523

Merged
merged 8 commits into from
Mar 28, 2024

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Mar 14, 2024

This PR resolves #520 by casting the incoming pyarrow Table to the same Schema as the Iceberg Table. This is safe since we first check for schema compatibility.

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/pyarrow-data-type branch from 24e7da0 to 05e7444 Compare March 24, 2024 19:38
@kevinjqliu kevinjqliu changed the title [wip] cast to pyarrow schema On write operation, cast data to Iceberg Table's pyarrow schema Mar 25, 2024
@kevinjqliu kevinjqliu marked this pull request as ready for review March 25, 2024 16:45
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I left one comment but we can also defer that to a later PR.

_check_schema(self.schema(), other_schema=df.schema)
_check_schema_compatible(self.schema(), other_schema=df.schema)
# the two schemas are compatible so safe to cast
df = df.cast(self.schema().as_arrow())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should _check_schema_compatible return a bool to indicate if the cast is needed? I'm not sure how costly the cast is. If we go from string to large_string then we might rewrite the Arrow buffers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I like that idea.
_check_schema_compatible returns a boolean should_cast.

  • If schema is exactly the same, return False and skip cast
  • If schema is "compatible", return True and cast
  • If schema is not "compatible", throws an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was too complicated when _check_schema_compatible returned a boolean and threw an error.
I ended up doing an extra comparison as Arrow schemas outside and cast only if necessary

@kevinjqliu kevinjqliu requested a review from Fokko March 26, 2024 03:22
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @kevinjqliu

_check_schema(self.schema(), other_schema=df.schema)
_check_schema_compatible(self.schema(), other_schema=df.schema)
# cast if the two schemas are compatible but not equal
if self.schema().as_arrow() != df.schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would be good to call as_arrow() just once in case we need to cast.

@Fokko Fokko merged commit 4c1cfdc into apache:main Mar 28, 2024
7 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/pyarrow-data-type branch March 28, 2024 17:57
@Fokko Fokko added this to the PyIceberg 0.6.1 milestone Mar 28, 2024
HonahX pushed a commit to HonahX/iceberg-python that referenced this pull request Mar 29, 2024
HonahX pushed a commit to HonahX/iceberg-python that referenced this pull request Mar 31, 2024
HonahX added a commit that referenced this pull request Mar 31, 2024
…ma (#559)

* Cast data to Iceberg Table's pyarrow schema (#523)

Backport to 0.6.1

* use schema_to_pyarrow directly for backporting

* remove print in test

---------

Co-authored-by: Kevin Liu <[email protected]>
HonahX pushed a commit that referenced this pull request Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Arrow large_string data type
2 participants