Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Exporter action: Handle vacuumed objects correctly #8409

Merged

Conversation

Jonathan-Rosenberg
Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg commented Dec 9, 2024

When traversing a delta lake log, we might come across entries of objects that were vacuumed. Prior to this change, we would stat all of the entry paths to get their physical address, even if those paths were vacuumed. This would cause a 404 "not found" status code to be returned, and the hook would error.
This fix handles such cases.
The resulting exported delta log will include logical paths of vacuumed objects but that doesn't matter for either reading a table, nor future vacuums.

How was it tested?

Manually: created a table, added data, altered the table such that the delta log included a path under an add entry (in the 0000...0.json file) and the same path under a remove entry (in the 0000...1.json file). I've deleted the path manually, thus mimicking vacuum, and committed which made the hook run.
Finally, verified that the resulting log is as expected, and read the table using Spark.

@Jonathan-Rosenberg Jonathan-Rosenberg added include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached labels Dec 9, 2024
@Jonathan-Rosenberg Jonathan-Rosenberg self-assigned this Dec 9, 2024
Copy link

github-actions bot commented Dec 9, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link

github-actions bot commented Dec 9, 2024

E2E Test Results - Quickstart

11 passed

Copy link
Contributor

@guy-har guy-har left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good
Thanks

if entry.remove ~= nil then
-- If the object is not found, and the entry is a remove entry, we can assume it was vacuumed
print(string.format("Object with path '%s' of a `remove` entry wasn't found. Assuming vacuum.", unescaped_path))
unfound_paths[unescaped_path] = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant?

Copy link
Contributor Author

@Jonathan-Rosenberg Jonathan-Rosenberg Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was an add entry (meaning the clause beneath) and a stat object failed, there might still be a remove entry with the same path further ahead which will cause a vacuum to delete it, and in that case we don't want to fail the process, thus we add it to the unfound_paths table.
This line indicates that if a remove occurred and a stat object failed, we should treat any previously added entries as vacuumed objects.
The test on line 155 checks if there were any failed add entry path stat object requests (meaning that they weren't removed by a remove entry), and if that's the case throw an error.

@Jonathan-Rosenberg Jonathan-Rosenberg merged commit 290ddef into master Dec 10, 2024
48 checks passed
@Jonathan-Rosenberg Jonathan-Rosenberg deleted the fix/delta-exporter/ignore-unfound-removed-entries branch December 10, 2024 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants