Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

Closed

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Jun 18, 2024

Fixes #674

This PR does the following

@kevinjqliu kevinjqliu changed the title Kevinjqliu/674 field mismatch Allow writing dataframes that are either a subset of table schema or in arbitrary order Jun 19, 2024
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@kevinjqliu
Copy link
Contributor Author

r? @Fokko / @HonahX

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, sorry for the late reply. Feel free to ping me more aggressively.

This initial version looks good, thanks for working on this. One important thing I would love to see fixed in here as well. How about re-aligning the table before we write, otherwise we have to do all of this when reading. Most tables have far fewer writes than reads, so it is good to optimize for reads. I was hoping to re-use to_requested_schema for this. WDYT?

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
Two schemas are considered compatible when they are equal in terms of the Iceberg Schema type.
The schemas are compatible if:
- All fields in `other_schema` are present in `table_schema`. (other_schema <= table_schema)
- All required fields in `table_schema` are present in `other_schema`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up, with V3 this changes since it is allowed to add required fields with a default value.

except ValueError as e:
other_schema = _pyarrow_to_schema_without_ids(other_schema)
additional_names = set(other_schema.column_names) - set(table_schema.column_names)
raise ValueError(
f"PyArrow table contains more columns: {', '.join(sorted(additional_names))}. Update the schema first (hint, use union_by_name)."
) from e

if table_schema.as_struct() != task_schema.as_struct():
fields_missing_from_table = {field for field in other_schema.fields if field not in table_schema.fields}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if it handles nested structs as well?

@kevinjqliu
Copy link
Contributor Author

First of all, sorry for the late reply. Feel free to ping me more aggressively.

No worries at all, I forgot to ping about this PR

How about re-aligning the table before we write, otherwise we have to do all of this when reading. Most tables have far fewer writes than reads, so it is good to optimize for reads.

Can you talk a bit more about "re-aligning"? Is it to match the parquet schema with that of Iceberg's?
I see that to_requested_schema is currently used to coerce the data before it is written to parquet.
https://github.com/apache/iceberg-python/blame/7dff359e0515839fbe24fac2108dcb2d64694b7a/pyiceberg/io/pyarrow.py#L1915-L1918

Is the idea to do so for the entire arrow table before writing? If so, maybe we can push the to_requested_schema up the stack and simplify write_parquet. I also mentioned this in #786 (comment)

@Fokko

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/674-field-mismatch branch from e87a59a to 19598c3 Compare July 9, 2024 00:44
@pdpark
Copy link

pdpark commented Jul 9, 2024

I still get the Mismatch in fields error when calling append with this fix. The problem appears to be that one of the schemas after conversion has a doc field and the other schema does not. (Hacked code removed - not relevant)

@kevinjqliu
Copy link
Contributor Author

@pdpark can you share the iceberg table schema and the pyarrow table schema?

@pdpark
Copy link

pdpark commented Jul 9, 2024

I can't share the schemas, but it's just a few fields with simple string and datetime data types. The names and data types are exactly the same in both schemas and in the same order. The only difference I could find was that the NestedField(s) of one of the schemas had a doc field defined while the other schema did not. Converting the schema fields to a string with the Python str function fixes the issue because the NestedField __str__ fixes the missing/empty doc field:

    def __str__(self) -> str:
        """Return the string representation of the NestedField class."""
        doc = "" if not self.doc else f" ({self.doc})"
        req = "required" if self.required else "optional"
        return f"{self.field_id}: {self.name}: {req} {self.field_type}{doc}"

BTW: this function made it difficult to debug because printing the schemas invoked__str__ so the doc fields looked identical in the output. I had to use the pydantic model_dump function to see the difference.

@pdpark
Copy link

pdpark commented Jul 9, 2024

FYI: here's the model_dump of the two schemas - only showing the first field:

table_schema: {'type': 'struct', 'fields': ({'id': 1, 'name': 'id', 'type': 'string', 'required': False, 'doc': ''}, ...), 'schema-id': 0, 'identifier-field-ids': []}

task_schema: {'type': 'struct', 'fields': ({'id': 1, 'name': 'id', 'type': 'string', 'required': False}, ...), 'schema-id': 0, 'identifier-field-ids': []}

Note that the table_schema field has a doc (pydantic) field, but the corresponding task_schema field does not.

@Fokko
Copy link
Contributor

Fokko commented Jul 10, 2024

Can you talk a bit more about "re-aligning"?

Let me give two examples:

Out of order

table {
  1: str foo
  2: int bar
}

It is fine to write a parquet file to this table with:

table {
  2: int bar
  1: str foo
}

When the table is being read, the columns are re-ordered by to_requested_schema.

Casting

The same goes for casting:

table {
  1: str foo
  2: long bar
}

It is fine to write:

table {
  1: str foo
  2: int bar
}

The upcasting to a long will be done when the data is being read, but it is less efficient since we first let Arrow read the data as an int, and then it will do the cast to long in to_requested_schema to be able to append the files.

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/674-field-mismatch branch from 19598c3 to 2ce6db3 Compare July 12, 2024 18:13
@@ -2053,7 +2055,10 @@ def _check_schema_compatible(table_schema: Schema, other_schema: pa.Schema, down
f"PyArrow table contains more columns: {', '.join(sorted(additional_names))}. Update the schema first (hint, use union_by_name)."
) from e

if table_schema.as_struct() != task_schema.as_struct():
fields_missing_from_table = {field for field in other_schema.fields if field not in table_schema.fields}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't work for nested structs, need a better solution

@@ -484,10 +484,6 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT)
_check_schema_compatible(
self._table.schema(), other_schema=df.schema, downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us
)
# cast if the two schemas are compatible but not equal
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 I want to get your take on this part. Due to the timestamp change, do you know if the df need to be casted?
There are a couple of different parts involved in the write path. In particular, we need to look at the table schema, the df schema, and the df itself. As well as dealing with bin-packing and other transformations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to extract this convo into an issue, to also continue the convo from #786 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 I want to get your take on this part. Due to the timestamp change, do you know if the df need to be casted? There are a couple of different parts involved in the write path. In particular, we need to look at the table schema, the df schema, and the df itself. As well as dealing with bin-packing and other transformations.

I have a PR open to try to fix this behavior: #910 I think it's almost ready to merge 😄

@kevinjqliu
Copy link
Contributor Author

Fixed in #921

@kevinjqliu kevinjqliu closed this Jul 17, 2024
@kevinjqliu kevinjqliu deleted the kevinjqliu/674-field-mismatch branch July 17, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: Mismatch in fields: ?
4 participants