-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy action #396
base: main
Are you sure you want to change the base?
Copy action #396
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
--- | ||
created: 2024-08-14 | ||
last updated: 2024-11-22 | ||
status: To be reviewed | ||
reviewers: | ||
- ___ | ||
title: Copy Action | ||
authors: | ||
- Silic0nS0ldier | ||
--- | ||
|
||
|
||
# Abstract | ||
|
||
Copying of files and directories is a common need that is currently underserved in Bazel. This proposal seeks to make this a builtin capability to improve performance and simplify rule development. | ||
|
||
# Background | ||
|
||
The following actions (`ctx.actions.*`) already exist; | ||
- `args` - An abstraction for memory-efficient argument management. May produce a file. | ||
- `declare_directory` - Declares a directory output. | ||
- `declare_file` - Declares a file output. | ||
- `declare_symlink` - Declares a symlink output. | ||
- `do_nothing` - Does nothing, only serving as an insertion point for the deprecated 'extra actions' feature. | ||
- `expand_template` - Creates a file using a template. | ||
- `template_dict` - Abstraction for `expand_template` that allows deferring evaluation of values. | ||
- `run` and `run_shell` - Runs an executable/shell script. May produce file(s), director(y/ies) and/or symlink(s). | ||
- `symlink` - Creates a symlink. | ||
- `write` - Creates a file. | ||
|
||
The closest analogue builtin analogue to copying a file is currently using `ctx.actions.symlink`, which depending on the scenario can lead to different | ||
behaviour at runtime (e.g. NodeJS import resolution is affected, and package managers like [pnpm](https://pnpm.io/) (including rulesets like | ||
[Rules JS](https://github.com/aspect-build/rules_js)) rely on such behavioural differences). This makes for a poor subsitutite. | ||
|
||
For a true file or directory copy, one of the following API surfaces must be used; | ||
- `ctx.actions.run` or `ctx.actions.run_shell`. | ||
- `genrule` | ||
|
||
Some utility rulesets include their own implemenations using these API; | ||
- Skylib | ||
- [`copy_directory` rule](https://github.com/bazelbuild/bazel-skylib/blob/5c071b5006bb9799981d04d74a28bdee2f000d4a/docs/copy_directory_doc.md). | ||
- [`copy_file` rule](https://github.com/bazelbuild/bazel-skylib/blob/5c071b5006bb9799981d04d74a28bdee2f000d4a/docs/copy_file_doc.md). | ||
- Aspect's Bazel Helpers Library | ||
- [`copy_directory` rule and `copy_directory_action` action-generating function](https://github.com/aspect-build/bazel-lib/blob/5d09fc1b8352ef276dd4dd873b3dc5b0f5482f19/docs/copy_directory.md). | ||
- [`copy_file` rule and `copy_file_action` action-generating function](https://github.com/aspect-build/bazel-lib/blob/5d09fc1b8352ef276dd4dd873b3dc5b0f5482f19/docs/copy_file.md). | ||
- [`copy_to_bin` rule, `copy_file_to_bin_action` action-generating function and `copy_files_to_bin_actions` action-generating function](https://github.com/aspect-build/bazel-lib/blob/5d09fc1b8352ef276dd4dd873b3dc5b0f5482f19/docs/copy_to_bin.md) | ||
- [`copy_to_directory` rule and `copy_to_directory_bin_action` action-generating function](https://github.com/aspect-build/bazel-lib/blob/5d09fc1b8352ef276dd4dd873b3dc5b0f5482f19/docs/copy_to_directory.md) | ||
|
||
Because all of these implementations rely on a spawn, they inherit associated weaknesses including; | ||
- Overhead of spawn is disproportionately large relative to the task. This is even worse in conjunction with remote execution. | ||
- Increases size (and potentially volume) of merkle trees, increasing CPU and memory requirements of Bazel. | ||
- Attempts to optimise for remote execution by forcing copy-related spawns to run locally suffers from caching issues... | ||
- ...in scenarios where clients are forbidden from uploading `ActionResult` metadata (listed outputs can be arbitrary, representing a security issue). | ||
- ...when copying files generated by the remote as they must be downloaded first. | ||
- Implementation changes can lead to cache misses. | ||
- If copies are batched, odds are in most runs only 1 input will have actually changed (doing more work than necessary, becomes a performance and efficency concern once the batch hits a certain size). | ||
|
||
# Proposal | ||
|
||
Introduce a new `copy` action with the following behaviors; | ||
- Content hash of output is identical to input, only the name is different. | ||
- Realisation of output is deferred, similar to how remote spawn actions are handled. | ||
- Action result is empty, similar to symlink actions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think what you mean here is that the result of the action isn't uploaded to a remote or disk cache? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree with having an empty action result. I think it's still valuable to have one I think things might break badly if we don't have one to track the output directory life time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did a very poor job of describing the requirement here, and this is after having rewritten this several times (while failing to fully recall what I was on about). Thinking on this again now, I think my past self was concerned about how remote/BES APIs would respond to a large number of copy operations (10k+) that are unbatched. That is, it is very easy to process 10k copy actions in a second (or there abouts) so keeping the number of events/calls they produce as a side effect low is important for keeping connections from becoming flooded and remote services healthy. There is definitely a better way to phrase this requirement, I'll need to poke around at the remote/BES API surfaces to fill some knowledge gaps first I think. This may be a simple as batching (not of the actions themselves, more the side effects like BES events and remote interactions). |
||
|
||
The API looks like `actions.copy(in: File, out: File, path: String|None): None`. | ||
- `actions.copy(in, out)` is a simple same-type copy operation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does "same type" refer to the artifact type or to the filesystem type? The two don't necessarily match: artifacts of file or directory type can be materialized as symlinks in the filesystem. What should
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was thinking artifact type when I wrote this, symlinks are something I've yet to figure out. It's very easy to break them, which to be fair is the case for the existing copy rules. The challenge here is that with some symlinks Bazel will actually dereference them and track the targeted artifact. Not sure what behaviour to go with here just yet. I feel like "copy-as-is" would be the least surprising option, keeps it simple.
Would make sense when in the source tree a file has been symlinked specifically so it's content can appear in multiple locations. Such cases would be ideally served by hardlinks (multiple canonical handles, same file) but that's not something version control systems keep track of. Symlinks to a directory ramps up the complexity. They can be used to present the same content in multiple locations, but they also have more novel uses like in Additionally symlinks can point to anything, including other symlinks (e.g. A case could also be made for reading the symlink target then creating the new "copy" with an adjusted target path. Opting to not do anything special for symlinks may be the best option, can always do the special niche operation (e.g. copying the file a symlink points to) in a separate action. |
||
Unanswered question: How are `actions.symlink` inputs handled? Update the target? Error? | ||
- `actions.copy(in, out, path = "foo/bar")` copies a file from the input directory. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear, I think you're referring to the case where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to Tiago comment. I would prefer to see separate definitions for cases such as:
I am also ok if we decide to only support File to File initially. |
||
`path` cannot be used with generic symlinks (`actions.declare_symlink`, source files??? and within tree artifacts support is later added) as Bazel does not track their target. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using it with source artifacts should be fine (what determines whether Bazel tracks the contents is the artifact type, not the filesystem type, and source artifacts are currently always of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For symlink source files, I think this is a non-issue. If the symlink points to a file, Bazel will treat it as a file and that can't be passed to I worded it poorly but I believe tree artifacts cannot contain symlinks, and even if they do later they'll be appropriately classified (directory, file, generic symlink) once copied out so I think the last item can be removed there. |
||
|
||
Compared to existing solutions, `actions.copy` should; | ||
- Complete much more quickly, as spawn machinery is not used and no subprocess is used. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An interesting question is whether the implementation is required to make an actual copy, or if hardlinks would suffice where supported. Do you know whether hardlinks also present problems for the use cases where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hardlinks are probably fine for outputs, only surprising if output tree checking is turned off and a file is modified. Copy-on-write is also an option (macOS has There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer if we start with an actual copy and add options to use hard link or copy-on-write(CoW) copy separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Btw, executable hardlinks have an issue with macOS Gatekeeper killing the wrong executables: |
||
- Allow other pending spawns to start sooner by sparing local resource pools. | ||
- Cut down on remote execution (TBD - fairly sure this shouldn't have an action metadata record) and BES (TBD - need to check protocols and get some traffic samples) traffic during batches of copy operations. | ||
|
||
## Example Usage | ||
|
||
### Skylib's [`copy_directory`](https://github.com/bazelbuild/bazel-skylib/blob/5c071b5006bb9799981d04d74a28bdee2f000d4a/rules/private/copy_directory_private.bzl) | ||
|
||
```starlark | ||
def _copy_directory_impl(ctx): | ||
dst = ctx.actions.declare_directory(ctx.attr.out) | ||
ctx.actions.copy(ctx.attr.src, dst) | ||
Comment on lines
+82
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this example, the desired output here would be a Directory which should be uploaded to remote cache CAS so that subsequent downstream actions could consume it. If the copy is large (i.e. Dir to Dir recursively), having an ActionResult would also save other Bazel clients from having to do the work. I also think that if |
||
|
||
files = depset(direct = [dst]) | ||
runfiles = ctx.runfiles(files = [dst]) | ||
|
||
return [DefaultInfo(files = files, runfiles = runfiles)] | ||
|
||
copy_directory = rule( | ||
implementation = _copy_directory_impl, | ||
provides = [DefaultInfo], | ||
attrs = { | ||
"src": attr.label(mandatory = True, allow_single_file = True), | ||
# Cannot declare out as an output here, because there's no API for declaring | ||
# TreeArtifact outputs. | ||
"out": attr.string(mandatory = True), | ||
}, | ||
) | ||
``` | ||
|
||
Compared to the canonical implementation; | ||
- Implementation is significantly smaller and drops a (fairly light) import. | ||
- Spawn action branches replaced by `copy` action. | ||
- `is_windows` attribute to pick spawn action branch is gone. | ||
- Macro to automatically set `is_windows` is gone. | ||
|
||
### Aspect's Bazel Helpers Library's [`copy_to_bin`](https://github.com/bazel-contrib/bazel-lib/blob/5d09fc1b8352ef276dd4dd873b3dc5b0f5482f19/lib/private/copy_file.bzl) | ||
|
||
```starlark | ||
def copy_file_to_bin_action(ctx, file): | ||
if not file.is_source: | ||
return file | ||
if ctx.label.workspace_name != file.owner.workspace_name: | ||
fail(_file_in_external_repo_error_msg(file)) | ||
if ctx.label.package != file.owner.package: | ||
fail(_file_in_different_package_error_msg(file, ctx.label)) | ||
|
||
if file.path.startswith("bazel-"): | ||
first = file.path.split("/")[0] | ||
suffix = first[len("bazel-"):] | ||
if suffix in ["testlogs", "bin", "out"]: | ||
print("__WARNING__") | ||
|
||
dst = ctx.actions.declare_file(file.basename, sibling = file) | ||
crx.actions.copy(file, dst) | ||
return dst | ||
|
||
def _file_in_external_repo_error_msg(file): | ||
return "__ERROR__" | ||
|
||
def _file_in_different_package_error_msg(file, curr_package_label): | ||
return "__ERROR__" | ||
|
||
def copy_files_to_bin_actions(ctx, files): | ||
return [copy_file_to_bin_action(ctx, file) for file in files] | ||
|
||
def _copy_to_bin_impl(ctx): | ||
files = copy_files_to_bin_actions(ctx, ctx.files.srcs) | ||
return DefaultInfo( | ||
files = depset(files), | ||
runfiles = ctx.runfiles(files = files), | ||
) | ||
|
||
copy_to_bin = rule( | ||
implementation = _copy_to_bin_impl, | ||
provides = [DefaultInfo], | ||
attrs = { | ||
"srcs": attr.label_list(mandatory = True, allow_files = True), | ||
}, | ||
) | ||
``` | ||
|
||
Compared to the canonical implementation; | ||
- Implementation remains about the same (common logic is in a separate file). | ||
- Common logic can be significantly simplified or removed in favor of direct `actions.copy` usage. | ||
- No toolchain resolution is necessary. | ||
- N/A for the macro since the the canonical implementation just fowards to the rule. | ||
|
||
# Backward-compatibility | ||
|
||
This proposal won't impact backward compatibility, although feature detection should be considered so that existing utility rulesets can optionally use the newer (and more efficent) API without needing to wait for supported Bazel versions to age out of the support matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation-wise this is probably the most challenging/contentious part, because (1) the "build without the bytes" machinery currently only allows delayed materialization of artifacts that already exist remotely, and (2) architecturally we don't have a way for an action to say "copy metadata from input to output", only "collect output metadata from the filesystem".
(I reworded this from an earlier version to state the difficulties more clearly. These are not insurmountable problems, but they'd be responsible for most of the implementation complexity, so we should be clear that accepting this proposal would require to take them on.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm under no illusions as to the complexity this particular requirement brings. Unfortunately it is also necessary to avoid a potential performance regression (vs just throwing a bunch of concurrent copy spawns at the remote).
For context, internally we've patched
skylib
copy rules that are configured out of the box to run locally, so that they can run remotely. This is for performance primarily (cutting down on uploads/downloads from the remote), but also correctness in general (remote is Linux while host is either Linux or macOS, will be less of an issue if/when my other proposal's implementation gets in).