-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix commit log parsing of Delta tables with delete vector #596
Conversation
xtable-core/src/main/java/org/apache/xtable/delta/DeltaConversionSource.java
Outdated
Show resolved
Hide resolved
addedFiles.remove(deletionVector); | ||
removedFiles.remove(deletionVector); | ||
} else { | ||
log.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we error out in this case? Just wondering if this will lead to data consistency issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think XTable should terminate translation if a RemoveFile
action corresponding to a AddFile+DeleteVector
action is not present. Mainly because I'm not certain if adding a file for the first time along with an associated delete vector violates the Delta spec. I believe the Delta API would allow it, but I need to validate this. It may be allowed to support use cases like a long-running transaction which first adds a file and then remove some entries from it, but I'm not sure. If the spec allows it, XTable should not terminate.
@@ -246,6 +246,7 @@ private static class DeltaStats { | |||
Map<String, Object> minValues; | |||
Map<String, Object> maxValues; | |||
Map<String, Object> nullCount; | |||
boolean tightBounds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing, you can add an option to the Jackson ObjectMapper to ignore unknown fields as well to avoid issues with future delta versions adding new fields our code is not aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I am assuming you are not suggesting removing tightBounds
(and not taking any action for the comment) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the option to the ObjectMapper as part of this PR to prevent new fields from causing failures in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a catch all bucket for new stats that may exist in file stats. It should be worth reporting the new stats for debugging and for adding support in XTable sometime.
xtable-core/src/test/java/org/apache/xtable/delta/TestDeltaActionsConverter.java
Outdated
Show resolved
Hide resolved
b190e66
to
2843ea0
Compare
xtable-core/src/main/java/org/apache/xtable/delta/DeltaStatsExtractor.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/delta/DeltaStatsExtractor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, remember to squash before merging
- Correct handling of delete vectors to avoid adding data file paths to both new and removed file sets in `FileDiff` - Detect new file stats in logs that are not supported by XTable and log them. New stats do not break parsing of the stats json. For e.g. `tightBounds` property in Delta stats if deletion vectors are enabled
41575fd
to
8ceb5e6
Compare
This change fixes the issue of double counting a data file. When adding deletion vector information to an existing data file, Delta Lake adds two entries in the commit log: one to remove the old entry (which has empty or outdated delete vector information) and another to add the updated delete vector information. Both entries are for the same data file path which was already added by an existing commit. While XTable currently doesn’t convert deletion vectors, it can avoid duplicate counting of the data file by ignoring the entries created when delete vector information is updated.
Additionally, this change detects new data file stats in logs that are not recognized and translated by XTable. Name of such unrecognized stats are logged. The fix ensures that new stats does not break parsing of the json representation of the file stats when unknown stats are encountered. For e.g. when a Delta table has the deletion vectors property set,
tightBounds
property is included in the file stat's json.New tests have been added to validate the changes.
Fixes #595