Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of source Delta table with delete vectors fails or is incorrect #595

Closed
2 tasks done
ashvina opened this issue Dec 9, 2024 · 0 comments · Fixed by #596
Closed
2 tasks done

Parsing of source Delta table with delete vectors fails or is incorrect #595

ashvina opened this issue Dec 9, 2024 · 0 comments · Fixed by #596
Assignees

Comments

@ashvina
Copy link
Contributor

ashvina commented Dec 9, 2024

Feature Request / Improvement

There are two issues with how XTable parses the commit log of source Delta tables that have the deletion vectors property set.

  1. Missing tightBounds Property: For Delta tables with deletion vectors, the file stats include an additional property called tightBounds. This property is missing in XTable's representation of the Delta stats. As a result, parsing commit logs fails.

Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "tightBounds" (class org.apache.xtable.delta.DeltaStatsExtractor$DeltaStats), not marked as ignorable (4 known properties: "nullCount", "numRecords", "maxValues", "minValues"])

  1. Incorrect Handling of Delete Vectors: When a delete vector is added for a data file in Delta Lake, the commit log contains both a remove and an add entry for the same data file. This is done to link deletion vector file to the data file. However, XTable incorrectly adds the data file path to both the new and removed file sets in FileDiff. XTable should ignore this since no new data file is generated. Instead, once representation of deletion vectors is added, it should report the addition of a deletion vector. For e.g.

`{"add":{"path":"part-00000-26ca587d-ca81-4bd5-be69-7ea9bcab6a8f-c000.parquet","partitionValues":{},"size":10181,"modificationTime":1733718517192,"dataChange":true,"stats":..."tightBounds":false}", "deletionVector":{"storageType":"u","pathOrInlineDv":"XYZ","offset":40,"sizeInBytes":42,"cardinality":5}}}

{"remove":{"path":"part-00000-26ca587d-ca81-4bd5-be69-7ea9bcab6a8f-c000.parquet","deletionTimestamp":1733718522733,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":10181}}`

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant