Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

troubleshooting-rbe: document Missing Digest #7994

Merged
merged 2 commits into from
Dec 4, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions docs/troubleshooting-rbe.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,49 @@ sudo sysctl -w net.ipv4.tcp_keepalive_probes=5
This will cause the Linux kernel to send keepalive probes earlier and more frequently, before the proxy/gateway in the middle detects and drops the idle connection.

The optimal values may depend on specific network conditions, but try these values as a starting point. Please [contact us](/contact/) if you have any questions / concerns.

## CacheNotFoundException: Missing digest

During remote build execution, Bazel may encounter a `CacheNotFoundException` error with the message `Missing digest`.

```bash
com.google.devtools.build.lib.remote.common.BulkTransferException: 3 errors during bulk transfer:
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: d0387e622e30ab61e39b1b91e54ea50f9915789dde7b950fafb0863db4a32ef8/17096
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 9718647251c8d479142d459416079ff5cd9f45031a47aa346d8a6e719e374ffa/28630
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 785e0ead607a37bd9a12179051e6efe53d7fb3eb05cc291e49ad6965ee2b613d/11504
```

This error occurs when Bazel is unable to find file(s) in the BuildBuddy Remote Cache that it expects to exist.

The first step to verify this issue would be to copy the hash of the missing blob.
Then navigate to the Invocation URL -> Cache -> "Cache requests" and paste the hash into the Filter input.
This will let you know if Bazel has tried to upload the blob to BuildBuddy Remote Cache or not.

If Bazel attempted to upload the blob and failed, there should be multiple retries attempted for the same blob.
The retry attempts can be configured with the `--remote_retries` (default 5) and `--remote_retry_max_delay` (default 5s) flags.
Additionally, `--experimental_collect_system_network_usage` (default true since Bazel 8) can be used to collect network usage data on Bazel's host machine.
This network data will be displayed as a graph in the "Timing" tab of the Invocation page.

If there was no attempt from Bazel to upload the missing blob, this is caused by a mismatch of expectation between Bazel's local state and the BuildBuddy Remote Cache.
In a previous invocation (usually with Build without the Bytes turned on), Bazel local state was taught to assume that the blob is already in the Remote Cache.
However, as time passed, the blob was evicted from BuildBuddy Remote Cache without Bazel's knowledge.

The best solution in this scenario is for Bazel to either re-upload the missing blob, or to re-execute the action that created the missing blob.
This is also known as "Action Rewinding" in Bazel terminology.
However due to the complexity of Bazel's code base, this feature is not yet fully implemented.

The existing solution includes 2 halves:

a. With `--experimental_remote_cache_lease_extension` and `--experimental_remote_cache_ttl` flags, Bazel will keep track of all the blobs involved in the latest invocation in a side-car thread.
This side-car will routinely "ping" BuildBuddy Remote Cache to let the server know that these blobs are still being used by Bazel.
Our remote cache server will update the last used timestamps of these blobs accordingly.

b. With `--experimental_remote_cache_eviction_retries` (default 5) flag, Bazel will detect this specific error code and attempt to reset the local states and re-try the build.
This will clear the local state kept by Bazel and re-analyze the repository to determine which blobs are missing and which actions need to be re-executed.

If neither of these flags work, try running `bazel clean --noasync` to clear the local state manually.
Bazel JVM should be shut down by the time the clean finished. You can check your process monitor to verify this.
Then re-run the build with the same flags as before.

We also recommend disabling the local Disk Cache with `--disk_cache=''` while troubleshooting this type of issue as well as avoid using any remote cache proxy solutions.
It will help narrowing down the root cause by not having to deal with multiple sources of remote cache.
fmeum marked this conversation as resolved.
Show resolved Hide resolved
Loading