Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix commit log parsing of Delta tables with delete vector #596

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

ashvina
Copy link
Contributor

@ashvina ashvina commented Dec 9, 2024

This change fixes the issue of double counting a data file. When adding deletion vector information to an existing data file, Delta Lake adds two entries in the commit log: one to remove the old entry (which has empty or outdated delete vector information) and another to add the updated delete vector information. Both entries are for the same data file path which was already added by an existing commit. While XTable currently doesn’t convert deletion vectors, it can avoid duplicate counting of the data file by ignoring the entries created when delete vector information is updated.

Additionally, this change detects new data file stats in logs that are not recognized and translated by XTable. Name of such unrecognized stats are logged. The fix ensures that new stats does not break parsing of the json representation of the file stats when unknown stats are encountered. For e.g. when a Delta table has the deletion vectors property set, tightBounds property is included in the file stat's json.

New tests have been added to validate the changes.

Fixes #595

@ashvina ashvina linked an issue Dec 9, 2024 that may be closed by this pull request
2 tasks
addedFiles.remove(deletionVector);
removedFiles.remove(deletionVector);
} else {
log.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we error out in this case? Just wondering if this will lead to data consistency issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think XTable should terminate translation if a RemoveFile action corresponding to a AddFile+DeleteVector action is not present. Mainly because I'm not certain if adding a file for the first time along with an associated delete vector violates the Delta spec. I believe the Delta API would allow it, but I need to validate this. It may be allowed to support use cases like a long-running transaction which first adds a file and then remove some entries from it, but I'm not sure. If the spec allows it, XTable should not terminate.

@@ -246,6 +246,7 @@ private static class DeltaStats {
Map<String, Object> minValues;
Map<String, Object> maxValues;
Map<String, Object> nullCount;
boolean tightBounds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing, you can add an option to the Jackson ObjectMapper to ignore unknown fields as well to avoid issues with future delta versions adding new fields our code is not aware of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I am assuming you are not suggesting removing tightBounds (and not taking any action for the comment) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the option to the ObjectMapper as part of this PR to prevent new fields from causing failures in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a catch all bucket for new stats that may exist in file stats. It should be worth reporting the new stats for debugging and for adding support in XTable sometime.

@ashvina ashvina force-pushed the 595-detect-delete-vectors-files branch from b190e66 to 2843ea0 Compare December 18, 2024 01:20
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, remember to squash before merging

- Correct handling of delete vectors to avoid adding data file paths to both new and removed file
  sets in `FileDiff`
- Detect new file stats in logs that are not supported by XTable and log them. New stats do not
  break parsing of the stats json. For e.g. `tightBounds` property in Delta stats if deletion
  vectors are enabled
@ashvina ashvina force-pushed the 595-detect-delete-vectors-files branch from 41575fd to 8ceb5e6 Compare December 18, 2024 18:26
@ashvina ashvina merged commit f676788 into main Dec 18, 2024
2 checks passed
@ashvina ashvina deleted the 595-detect-delete-vectors-files branch December 18, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing of source Delta table with delete vectors fails or is incorrect
2 participants