Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed: I/O exception during sandboxed execution: No such file or directory #22151

Closed
Ryang20718 opened this issue Apr 26, 2024 · 16 comments
Closed
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug

Comments

@Ryang20718
Copy link

Ryang20718 commented Apr 26, 2024

Description of the bug:

Periodically, the following error would occur when running tests

 Testing <blah> failed: I/O exception during sandboxed execution: /dev/shm/bazel-sandbox.34e9fe25bb0c6624a8ba8f5a00a18c3243dfc943119e415e09934c77b955441f/linux-sandbox/148/stats.out (No such file or directory)

we're on bazel 6.5.0 with spawn strategy linux-sandbox, jobs set to 1:1 with vcpus, sandbox mounted at /dev/shm

Whenever this occurs, we see system memory usage at 82-83% with cpu maxed at 100%.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I don't have a reliable repro. it happens sporadically

Which operating system are you running Bazel on?

ubuntu 20.04

What is the output of bazel info release?

release 6.5.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

this has been occuring more frequently since we switched to 6.5.0 from 6.3.2

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@iancha1992
Copy link
Member

@Ryang20718 Could you please provide sample code and complete steps to reproduce this issue?
Also, please try updating your Bazel to one of our latest releases (See https://github.com/bazelbuild/bazel/releases). Thank you.

@iancha1992 iancha1992 added more data needed team-Local-Exec Issues and PRs for the Execution (Local) team labels Apr 26, 2024
@Ryang20718
Copy link
Author

I don't have a reliable repro; it just periodically happens when running large amounts of tests. we're in the process of upgrading to bazel 7, but still need to upgrade some dependencies to get there

@meisterT meisterT added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Apr 30, 2024
@mattyclarkson
Copy link
Contributor

To provide another data point: I have hit the similar error message in rules_git log. That does the git checkout in the action into a declared directory.

It is reproducible when the same cache is used but not consistent across executions. For example, the remote execution passed fine.

That run used Bazel 7.1.1.

@nikhilkalige
Copy link
Contributor

#20976

@oquenchil
Copy link
Contributor

Does everyone affected by this NOT have dynamic execution enabled?

@mattyclarkson
Copy link
Contributor

No dynamic execution was enabled for rules_git

@Ryang20718
Copy link
Author

no dynamic execution (this is with a local execution with remote cache)

@oquenchil
Copy link
Contributor

Aha, the remote cache bit is interesting too. @mattyclarkson did you have remote cache enabled too?

@mattyclarkson
Copy link
Contributor

The "remote" build passed which was using remote execution and remote cache.

The "local" build failed witch was running locally on the GitLab runner instance and was using a disk cache (which is stored/restored from the GitLab runner S3 bucket).

@Ryang20718
Copy link
Author

adding some details here:

We've seen this same error in the following situations:

  1. System Mem is close to OOM
  2. System Mem is only at 50% usage

Originally I had thought it was a system error, but the 2nd bullet point indicates otherwise. (also had plenty of inodes + disk storage)

@fmeum
Copy link
Collaborator

fmeum commented Jun 18, 2024

@oquenchil While I don't understand why exactly this is happening, it looks like linux-sandbox.cc has a number of ways to exit abnormally (e.g. via DIE on syscall failures) that would result in the spawn failing without the stats.out file having been written. What do you think of making the error here recoverable, perhaps showing or logging a warning:

@sluongng
Copy link
Contributor

sluongng commented Jun 18, 2024

Agree with Fabian diagnosis here. The most consistent theme from the reports in Slack was stats.out missing, most likely due to the sandbox process being killed abnormally (via OOM) or run into some disk issue and failing to write out the stats file.

Currently, on the java side, we are catching IOException here.
However, if the sandbox subprocess was executed normally and exited with a non-zero code, we will unconditionally look for the execution stats file at the end of this method. This calls into ExecutionStatistics.java, as Fabian linked above, and throws an exception for the file not being available.

Since stats collection should be a non-critical feature, it should be done on a best-effort basis. The fix should be ExecutionStatistics.getResourceUsage() catching the error, logging out some warning and just returns Optional.empty()

bazel-io pushed a commit to bazel-io/bazel that referenced this issue Jun 18, 2024
This was already the case for "local" spawns. Statistics may be missing if the spawn wrapper exits abnormally.

Fixes bazelbuild#22151.

Closes bazelbuild#22780.

PiperOrigin-RevId: 644378541
Change-Id: Ia3d792f380b78945523f21875c593744b60f0c81
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Jun 18, 2024
This was already the case for "local" spawns. Statistics may be missing if the spawn wrapper exits abnormally.

Fixes bazelbuild#22151.

Closes bazelbuild#22780.

PiperOrigin-RevId: 644378541
Change-Id: Ia3d792f380b78945523f21875c593744b60f0c81
github-merge-queue bot pushed a commit that referenced this issue Jun 19, 2024
#22790)

This was already the case for "local" spawns. Statistics may be missing
if the spawn wrapper exits abnormally.

Fixes #22151.

Closes #22780.

PiperOrigin-RevId: 644378541
Change-Id: Ia3d792f380b78945523f21875c593744b60f0c81

Commit
ec41dd1

Co-authored-by: Fabian Meumertzheim <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Jun 19, 2024
#22791)

This was already the case for "local" spawns. Statistics may be missing
if the spawn wrapper exits abnormally.

Fixes #22151.

Closes #22780.

PiperOrigin-RevId: 644378541
Change-Id: Ia3d792f380b78945523f21875c593744b60f0c81

Commit
ec41dd1

Co-authored-by: Fabian Meumertzheim <[email protected]>
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.2.1 RC2. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.1rc2. Thanks!

@mattyclarkson
Copy link
Contributor

I've pinned rules_git and using the RC in the BCR presubmit in bazelbuild/bazel-central-registry#1868. Seeing no issues.

@mattyclarkson
Copy link
Contributor

Hit an issue on CI run of rules_git when I rebased and the CI re-ran:

ERROR: github-mozilla-deepspeech/BUILD.bazel:3:20: Testing //github-mozilla-deepspeech:checkout failed: I/O exception during sandboxed execution: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/a1208da49aaa9451b147b4d0696a68a7/execroot/_main/bazel-out/k8-fastbuild/bin/external/_main~_repo_rules~github-mozilla-deepspeech-0.9.3/checkout/tensorflow/native_client/ctcdecode/third_party/openfst-1.6.7/src/include/fst/extensions/pdt (No such file or directory)

link

@fmeum
Copy link
Collaborator

fmeum commented Jun 25, 2024

@mattyclarkson That's a different type of bug as it's not about stats.out. Could you file a separate issue for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants