Handle Arrow `large_string` data type #520

kevinjqliu · 2024-03-12T16:45:50Z

Feature Request / Improvement

Currently, large_string data type is converted to string (link)

This breaks the parquet writer when we're writing an Arrow table with a large_string column

See pola-rs/polars#9795

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2024-03-12T16:53:20Z

Looks like it was added in #382 for #226

kevinjqliu · 2024-03-12T16:53:50Z

On that note, should we review the pyarrow Schema to Iceberg Schema type mappings within the repository and ensure that all types that are supported in the existing parquet type -> Spark data type -> Iceberg data type conversions are supported in parquet type -> PyArrow data type -> Iceberg data type conversions?

#226 (comment)

++ to @syun64 's comment

kevinjqliu · 2024-03-14T04:22:05Z

I can think of 2 options.

Add Arrow LargeString as an Iceberg data type. Map 1:1 with Arrow data type. The physical representation will still be backed by string.
Arrow LargeString is already converted to Iceberg String type in create_table by _convert_schema_if_needed (see Arrow: Support large-string #382). So when writing an Arrow table (in overwrite/append), convert the given Arrow table schema to the table's schema, after checking the two schemas are compatible.

Example:

iceberg-python/pyiceberg/table/__init__.py

Line 1138 in 36a505f

_check_schema(self.schema(), other_schema=df.schema)

        _check_schema(self.schema(), other_schema=df.schema)
        # safe to cast
        from pyiceberg.io.pyarrow import schema_to_pyarrow
        pyarrow_schema = schema_to_pyarrow(self.schema())
        df = df.cast(pyarrow_schema)

WIP example in #523

@Fokko @HonahX @syun64 would love your opinions on this

kevinjqliu mentioned this issue Mar 14, 2024

Default to Arrow String type instead of LargeString pola-rs/polars#15047

Closed

Fokko mentioned this issue Mar 18, 2024

feat: Implement the conversion from Arrow Schema to Iceberg Schema apache/iceberg-rust#258

Merged

kevinjqliu mentioned this issue Mar 18, 2024

Add as_arrow() to Schema class #532

Merged

dev-goyal mentioned this issue Mar 23, 2024

Pyarrow type error #541

Closed

kevinjqliu mentioned this issue Mar 24, 2024

On write operation, cast data to Iceberg Table's pyarrow schema #523

Merged

Fokko closed this as completed in #523 Mar 28, 2024

kevinjqliu mentioned this issue Jul 2, 2024

feat(python): Add DataFrame.write_iceberg pola-rs/polars#15018

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Arrow `large_string` data type #520

Handle Arrow `large_string` data type #520

kevinjqliu commented Mar 12, 2024

kevinjqliu commented Mar 12, 2024

kevinjqliu commented Mar 12, 2024

kevinjqliu commented Mar 14, 2024 •

edited

Loading

Handle Arrow large_string data type #520

Handle Arrow large_string data type #520

Comments

kevinjqliu commented Mar 12, 2024

Feature Request / Improvement

kevinjqliu commented Mar 12, 2024

kevinjqliu commented Mar 12, 2024

kevinjqliu commented Mar 14, 2024 • edited Loading

Handle Arrow `large_string` data type #520

Handle Arrow `large_string` data type #520

kevinjqliu commented Mar 14, 2024 •

edited

Loading