Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Arrow large_string data type #520

Closed
kevinjqliu opened this issue Mar 12, 2024 · 3 comments · Fixed by #523
Closed

Handle Arrow large_string data type #520

kevinjqliu opened this issue Mar 12, 2024 · 3 comments · Fixed by #523

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

Currently, large_string data type is converted to string (link)

This breaks the parquet writer when we're writing an Arrow table with a large_string column

See pola-rs/polars#9795

@kevinjqliu
Copy link
Contributor Author

Looks like it was added in #382 for #226

@kevinjqliu
Copy link
Contributor Author

On that note, should we review the pyarrow Schema to Iceberg Schema type mappings within the repository and ensure that all types that are supported in the existing parquet type -> Spark data type -> Iceberg data type conversions are supported in parquet type -> PyArrow data type -> Iceberg data type conversions?

#226 (comment)

++ to @syun64 's comment

@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Mar 14, 2024

I can think of 2 options.

  1. Add Arrow LargeString as an Iceberg data type. Map 1:1 with Arrow data type. The physical representation will still be backed by string.
  2. Arrow LargeString is already converted to Iceberg String type in create_table by _convert_schema_if_needed (see Arrow: Support large-string #382). So when writing an Arrow table (in overwrite/append), convert the given Arrow table schema to the table's schema, after checking the two schemas are compatible.

Example:

_check_schema(self.schema(), other_schema=df.schema)

        _check_schema(self.schema(), other_schema=df.schema)
        # safe to cast
        from pyiceberg.io.pyarrow import schema_to_pyarrow
        pyarrow_schema = schema_to_pyarrow(self.schema())
        df = df.cast(pyarrow_schema)

WIP example in #523

@Fokko @HonahX @syun64 would love your opinions on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant