-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crc mismatch during deepstore upload retry task #14506
base: master
Are you sure you want to change the base?
Conversation
c6b7b40
to
6408e8f
Compare
6408e8f
to
249a4ee
Compare
...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java
Outdated
Show resolved
Hide resolved
pinot-server/src/main/java/org/apache/pinot/server/api/resources/TablesResource.java
Outdated
Show resolved
Hide resolved
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14506 +/- ##
============================================
+ Coverage 61.75% 63.91% +2.15%
- Complexity 207 1570 +1363
============================================
Files 2436 2673 +237
Lines 133233 146814 +13581
Branches 20636 22513 +1877
============================================
+ Hits 82274 93831 +11557
- Misses 44911 46051 +1140
- Partials 6048 6932 +884
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't solve the problem when the committing server is not up. A more robust way is to allow server to change the crc when updating the download URL. Having inconsistent crc in ZK metadata and deep store segment is very risky
Ideally we need to make text index deterministic. Having indeterministic crc can cause lots of problems. Re-downloading all segments during server startup is not acceptable |
So currently server uploads the file with a UUID suffix inside the segmentUploadDir and it is controller which moves the zip file to the final deepstore location and then controller updates the download URL. If we move the logic of updating ZK to server that might become more risky as controller <> deepstore may fail and we might be left with a non-empty downloadURL pointing to an empty path (may cause FileNotFound exception in other places).
Agreed! But this seems a larger scope problem to what we are doing here. Right now given we know replicas have divergent crc, deepstore-upload retry backing up segments from random replica seems to be problematic which we are trying to solve. |
Let me clarify: we keep the existing update logic, but also update crc when modifying download url. Both crc and download url update should be posted atomically as one ZK record change. |
Makes sense! Updated the PR accordingly. |
4888c37
to
98ffecd
Compare
98ffecd
to
271c305
Compare
After #14406, we are able to successfully take deepstore backup but now we see that there are lot of UpsertCompactionTask failures with the following error:
It seems that the deepstore-upload retry task can take a backup from any arbitrary replica and not particularly the one with which the CRC matches in ZK. This patch is to fix the issue where we allow deepstore upload only from the replica which matches the ZK metadata's crc values. If we don't find one, then we end up taking the deepstore backup anyways from one random replica.
For divergent CRC in replicas, the reason can be particularly using text-indexes. We have been discussing this in multiple issues: #13491, #11004