-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define the cache
key for v1 multi-output recipes
#102
base: main
Are you sure you want to change the base?
Conversation
cache
output for v1 multi-output recipes
cache
output for v1 multi-output recipescache
key for v1 multi-output recipes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up, excited to see this moving forward! 🙏 😊
|
||
Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build. | ||
|
||
There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just that, tests under such an output will be silently skipped, which regularly trips people up: conda/conda-build#4172
|
||
When the cache build is done, the newly created files are moved outside of the `host-prefix`. Post-processing is not performed on the files beyond memoizing what files contain the `$PREFIX` (which is later replaced in binaries and text files with the actual build-prefix with a simple 1-to-1 replacement as the prefix-length are guaranteed to be equal). | ||
|
||
The cache _only_ restores files that were added to the prefix (conda-build also restored source files). Sources are re-created from scratch and no intermediate state is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a pretty serious constraint? Not being able to use a hot cache would kill certain recipes which need to run something on top of the cache:
build - a classic example is building a library in the global stage, and the python bindings in the outputs (where the python bindings might otherwise rebuild the lib if not already present).
Another recent example would not work (though we decided against doing that an unrelated reason, i.e. that installing clobbered files is not guaranteed to follow the dependency relationship of outputs).
I have several other feedstocks that are installing from the intermediate build in all of the outputs, and it's unmaintainable to move these to simple files:
directives under the current circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my vision for this was to have the build folder of cmake / meson etc. in a common place and we could add a new build time variable that would be constant (e.g. $SCRATCH
).
Then your cmake invocation could look like:
cd $SCRATCH
cmake $SRC_DIR -DCMAKE_INSTALL_PREFIX=$PREFIX
ninja install
The $SCRATCH dir could be on /tmp
or in the work dir, but just not part of the $SRC_DIR
, to keep it pristine.
I tried this in this repo and ran into some issue: we do create independent build folders per output that have different names. And CMake detects that and then complains, when invoked with differnet options, that the source dir doesn't match anymore:
│ │ + cmake $SRC_DIR -GNinja -DCMAKE_INSTALL_PREFIX=$PREFIX -DBUILD_PYTHON_BINDINGS=ON
│ │ CMake Error: The source "$SRC_DIR/CMakeLists.txt" does not match the source "/Users/wolfv/Programs/calculator/output/bld/rattler-build_libcalculator_1732782319/work/CMakeLists.txt" used to generate cache. Re-run cmake with a different source directory.
We have a few options:
- Keep the same path for the "work dir" the for all outputs (this will prevent concurrent builds in the future). This does not need to imply that we restore the (dirty) work dir from the cache.
- We could also restore the "dirty" work dir from the cache. For that, we would need to keep a pristine copy of the work dir at cache time, because every output might modify or add files to the work dir. In this case, we would have to prevent every output to restore the global sources, and instead only restore the dirty work dir which already has the global sources. We would also have to keep the path to the workdir the same.
We could still have parallel builds or actions if the outputs are merely splitting the existing cache files (we could find if they have a build script or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that sounds reasonable to me. I think I'd prefer option 2, but I don't understand though why
we would have to prevent every output to restore the global sources, and instead only restore the dirty work dir which already has the global sources.
is necessary. Some output may genuinely need to reconfigure something that needs the sources, even though they want to use a hot cache of the dirty work dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unexpected. cache only caches files that are installed to $PREFIX? If you want to cache build artifacts, then you need to store them in $SCRATCH?
What about projects that do build-time source-code generation in-place? For example, cython projects which convert .pyx files into .c files. Or projects like magma which at build time duplicate the entire source from real float implementation into double and complex types (they don't use C++ templates). These generated sources will be need to be regenerated if the outputs need to reuse them (e.g. if you are trying to reuse the cache to amortize work across builds for multiple python versions)?
In order to keep state from the build around in the outputs, one could exercise the build in the $PREFIX and ignore the files in the outputs, or the build folder coule be placed in the `/tmp` folder: | ||
|
||
```bash | ||
cmake -B /tmp/mybuild -S . | ||
cmake --build /tmp/mybuild | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So /tmp
is special-cased in that it's not touched by any of the other build-logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah just any random folder outside the work dir would do.
Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build. | ||
</details> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To double-check, this would save $SRC_DIR
to the scratchpad in between operations, and restore it for the installation of any output that needs it?
I would go as far as saying that cache:
implies this behaviour. I don't want to cache stuff in $PREFIX
(then I could just install it and call that an output), I want to cache it without installation.
- ${{ pin_subpackage("foo-headers") }} | ||
``` | ||
|
||
Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour of this, just noting that this diverges from conda-build, which currently does not inject REs from the global build-step to the outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host
.
I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.
Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build. | ||
</details> | ||
|
||
The new files can then be used in the outputs with the `build.files` key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I denote a dependence of a given output on the cache:
phase? For example, we had a bunch of discussion in pytorch over conda-forge/pytorch-cpu-feedstock@981a04d (later reverted conda-forge/pytorch-cpu-feedstock@54fa422), because it is undocumented under what conditions a build will "fan" out (e.g. build multiple python versions on top of the same library; rather than generate a job per python version)
In other words, we need to have some clarity whether an output like
cache:
requirements:
build:
- ${{ compiler('c') }}
host:
- # no python here
outputs:
- package:
name: foo
requirements:
host:
- # depends on cache implicitly?
- python
together with a CBC of
python:
- 3.9
- 3.10
- 3.11
- 3.12
will actually "fold" the python dimension into foo
or not.
In conda-build, all outputs implicitly depend on the global build stage; perhaps we should make this explicit? build.files
already does this partway (for outputs that only use files:
). On the other hand, it might be more intuitive to keep the implication that cache:
affects everything else - see discussion about skip:
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example repo you can try: https://github.com/wolfv/rattler-build-cache-test
It creates one libcalculator
and two py-calculators
based on the variants.yaml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, right now, if there is a cache, all outputs are affected.
build: | ||
# note: in the current implementation, all other keys for build | ||
# are also valid, but unused | ||
script: build.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that's been a permanent annoyance with the way things have grown in conda-build is that a "global" build.skip: ...
does not mean the entire build will get cancelled, because - in theory - the various outputs could do their own thing that doesn't rely on the package-level build. So in order to actually skip a given combination, it needs to be skipped on all outputs (which quickly gets annoying in large multi-output recipes, especially for quick debugging).
If we go with the "all outputs implicitly depend on cache:
" approach, then I think it would be good to specify that a skip: true
on the cache:
level completely skips the build, without having to repeat it per output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think the cache should have anything to do with skipping.
The top-level can skip fine (which is intersected with the outputs).
Ie. would skip everything on win:
build:
skip: [win]
outputs:
- ...
@h-vetinari I played around with also caching the "source" in prefix-dev/rattler-build#1226 and the example in https://github.com/wolfv/rattler-build-cache-test. I now think that the The example would then look like: recipe:
name: calculator
version: 1.0.0
cache:
source:
path: ../
requirements:
build:
- ${{ compiler('cxx') }}
- cmake
- ninja
build:
script:
# make sure that `alternative_name.md` is not present
- test ! -f ./alternative_name.md
- mkdir build
- cd build
- cmake $SRC_DIR -GNinja ${CMAKE_ARGS}
- ninja install
outputs:
# this first output will include all files installed during the cache build
- package:
name: libcalculator
requirements:
run_exports:
- ${{ pin_subpackage('libcalculator') }}
- package:
name: py-calculator
source:
- path: ../README.md
file_name: alternative_name.md
requirements:
build:
- ${{ compiler('cxx') }}
- cmake
- ninja
host:
- pybind11
- python
- libcalculator
build:
script:
# assert that the README.md file is present
- test -f ./alternative_name.md
- cd build
- cmake $SRC_DIR -GNinja ${CMAKE_ARGS} -DBUILD_PYTHON_BINDINGS=ON
- ninja install |
That's so close to the idea of outputs:
- cache:
name: optional, only needed if there are several caches
- package:
name: my-output
build:
needs-cache: bool # or cache.name |
Yeah it could be done, but I think we can let it wait for v2 of the recipe format! :) |
That sounds fine in principle. Could you explain a bit more what the thinking behind that is though? If there's already a
That would sound even better! More powerful and yet less divergence. (no opinion on whether this is v1 or v2 stuff) However, this I don't like outputs:
# this first output will include all files installed during the cache build
- package:
name: libcalculator I think this should never be automatic, and always need an explicit opt-in. Either the output says |
@carterbox, please review |
# run: ... | ||
# run_constraints: ... | ||
# ignore_run_exports: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense that sections besides cache.build.script
, cache.requirements.build
and cache.requirements.host
are allowed because this is a cache that would never be installed to an end-user environment.
The runtime dependencies don't need to be resolved for the cache because all of the relevant dependencies should be explicitly repeated in the actual outputs packages.
|
||
<details> | ||
<summary>script build.sh default discussion</summary> | ||
We had some debate wether the cache output should _also_ default to `build.sh` or should not have any default value for the `script`. This is still undecided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a multi-output recipe, do the outputs default to build.sh
?
|
||
When the cache build is done, the newly created files are moved outside of the `host-prefix`. Post-processing is not performed on the files beyond memoizing what files contain the `$PREFIX` (which is later replaced in binaries and text files with the actual build-prefix with a simple 1-to-1 replacement as the prefix-length are guaranteed to be equal). | ||
|
||
The cache _only_ restores files that were added to the prefix (conda-build also restored source files). Sources are re-created from scratch and no intermediate state is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unexpected. cache only caches files that are installed to $PREFIX? If you want to cache build artifacts, then you need to store them in $SCRATCH?
What about projects that do build-time source-code generation in-place? For example, cython projects which convert .pyx files into .c files. Or projects like magma which at build time duplicate the entire source from real float implementation into double and complex types (they don't use C++ templates). These generated sources will be need to be regenerated if the outputs need to reuse them (e.g. if you are trying to reuse the cache to amortize work across builds for multiple python versions)?
name: libfoo | ||
build: | ||
files: | ||
- lib/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to support negative globs in outputs.build.files
selection? For example:
- package:
name: foo-dev
build:
files:
include:
- lib/**
exclude:
- lib/**.so.*
- package:
name: libfoo${{ soname }}
build:
files:
include:
- lib/**.so.*
Would be used to separate runtime libraries from development artifacts because only versioned shared objects are needed at runtime.
More recently, I've also been working on recipes that separate optional plugins into their own output. These are also typically installed to lib/
- ${{ pin_subpackage("foo-headers") }} | ||
``` | ||
|
||
Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host
.
I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.
This starts the definition for the new
cache
output in v1 recipes.One can try the cache with
--experimental
inrattler-build
. The documentation can be found here: https://prefix-dev.github.io/rattler-build/latest/multiple_output_cache/