-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only deduplicate currently uploading task, rather than all uploaded task when upload unified #531
Comments
I don't quite understand the issue you're describing. Can you clarify the relationship between the SDK and the k8s pod in your setup? Is the SDK running inside the pod or is the pod running a CAS service that the SDK is connected to? |
Yes! The SDK has connected to the bazel-remote local CAS as a cache service, which is deployed through k8s. We've noticed that this cache service restarts during construction, cleaning up the local disk cache during restarts. After the bazel-remote restart, the RBE's remote compilation task calls FindMissingBlobs to get the list of missing files, but before the unified upload, it finds that the file has been uploaded by checking the global cache casUploaders, so RBE skips uploading this file. This leads to an error in the remote compilation cluster when compiling this task, stating that the bazel-remote's CAS missing files. |
Here is what I understood so far: You have a cache server that clears its state when it restarts. Using unified uploads, it's possible for I'm not sure what the SDK can do in this case. This failure mode is inherent in the system design such that two subsequent calls are not guaranteed to see the same result from the same service. I don't think REAPI can work around such limitation as it assumes the CAS is stable long enough for two clients (build host and worker host) to see the same state. Perhaps configuring the pod for bazel-remote with persistent storage would be the best approach. |
The current implementation of unified file uploading makes an assumption: the default cache cluster always contains previously uploaded files. Even if a subsequent compilation task calls FindMissingBlobs and discovers that a file is missing, when uploading the file, it still directly returns the previous upload result from the global cache based on the cached results recorded in casUploaders, without actually performing the upload. However, when the cache service within a specific Kubernetes (k8s) pod crashes and restarts, the local cache inside that pod gets cleared, rendering this assumption invalid. In such cases, it is necessary for us to deduplicate only the tasks that are currently in the file uploading state, rather than deduplicating all previously uploaded tasks.
The text was updated successfully, but these errors were encountered: