Define the `cache` key for v1 multi-output recipes #102

wolfv · 2024-11-27T13:51:20Z

This starts the definition for the new cache output in v1 recipes.

One can try the cache with --experimental in rattler-build. The documentation can be found here: https://prefix-dev.github.io/rattler-build/latest/multiple_output_cache/

h-vetinari

Thank you for writing this up, excited to see this moving forward! 🙏 😊

h-vetinari · 2024-11-27T23:12:24Z

cep-cache-output.md

+
+Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build.
+
+There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level).


Not just that, tests under such an output will be silently skipped, which regularly trips people up: conda/conda-build#4172

h-vetinari · 2024-11-27T23:18:33Z

cep-cache-output.md

+
+When the cache build is done, the newly created files are moved outside of the `host-prefix`. Post-processing is not performed on the files beyond memoizing what files contain the `$PREFIX` (which is later replaced in binaries and text files with the actual build-prefix with a simple 1-to-1 replacement as the prefix-length are guaranteed to be equal).
+
+The cache _only_ restores files that were added to the prefix (conda-build also restored source files). Sources are re-created from scratch and no intermediate state is kept.


That's actually a pretty serious constraint? Not being able to use a hot cache would kill certain recipes which need to run something on top of the cache: build - a classic example is building a library in the global stage, and the python bindings in the outputs (where the python bindings might otherwise rebuild the lib if not already present).

Another recent example would not work (though we decided against doing that an unrelated reason, i.e. that installing clobbered files is not guaranteed to follow the dependency relationship of outputs).

I have several other feedstocks that are installing from the intermediate build in all of the outputs, and it's unmaintainable to move these to simple files: directives under the current circumstances.

So my vision for this was to have the build folder of cmake / meson etc. in a common place and we could add a new build time variable that would be constant (e.g. $SCRATCH).

Then your cmake invocation could look like:

cd $SCRATCH cmake $SRC_DIR -DCMAKE_INSTALL_PREFIX=$PREFIX ninja install

The $SCRATCH dir could be on /tmp or in the work dir, but just not part of the $SRC_DIR, to keep it pristine.

I tried this in this repo and ran into some issue: we do create independent build folders per output that have different names. And CMake detects that and then complains, when invoked with differnet options, that the source dir doesn't match anymore:

│ │ + cmake $SRC_DIR -GNinja -DCMAKE_INSTALL_PREFIX=$PREFIX -DBUILD_PYTHON_BINDINGS=ON │ │ CMake Error: The source "$SRC_DIR/CMakeLists.txt" does not match the source "/Users/wolfv/Programs/calculator/output/bld/rattler-build_libcalculator_1732782319/work/CMakeLists.txt" used to generate cache. Re-run cmake with a different source directory.

We have a few options:

Keep the same path for the "work dir" the for all outputs (this will prevent concurrent builds in the future). This does not need to imply that we restore the (dirty) work dir from the cache.

We could also restore the "dirty" work dir from the cache. For that, we would need to keep a pristine copy of the work dir at cache time, because every output might modify or add files to the work dir. In this case, we would have to prevent every output to restore the global sources, and instead only restore the dirty work dir which already has the global sources. We would also have to keep the path to the workdir the same.

We could still have parallel builds or actions if the outputs are merely splitting the existing cache files (we could find if they have a build script or not).

All that sounds reasonable to me. I think I'd prefer option 2, but I don't understand though why

we would have to prevent every output to restore the global sources, and instead only restore the dirty work dir which already has the global sources.

is necessary. Some output may genuinely need to reconfigure something that needs the sources, even though they want to use a hot cache of the dirty work dir.

This is unexpected. cache only caches files that are installed to $PREFIX? If you want to cache build artifacts, then you need to store them in $SCRATCH?

What about projects that do build-time source-code generation in-place? For example, cython projects which convert .pyx files into .c files. Or projects like magma which at build time duplicate the entire source from real float implementation into double and complex types (they don't use C++ templates). These generated sources will be need to be regenerated if the outputs need to reuse them (e.g. if you are trying to reuse the cache to amortize work across builds for multiple python versions)?

h-vetinari · 2024-11-27T23:19:54Z

cep-cache-output.md

+In order to keep state from the build around in the outputs, one could exercise the build in the $PREFIX and ignore the files in the outputs, or the build folder coule be placed in the `/tmp` folder:
+
+```bash
+cmake -B /tmp/mybuild -S .
+cmake --build /tmp/mybuild
+```


So /tmp is special-cased in that it's not touched by any of the other build-logic?

Nah just any random folder outside the work dir would do.

h-vetinari · 2024-11-27T23:21:58Z

cep-cache-output.md

+Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
+</details>


To double-check, this would save $SRC_DIR to the scratchpad in between operations, and restore it for the installation of any output that needs it?

I would go as far as saying that cache: implies this behaviour. I don't want to cache stuff in $PREFIX (then I could just install it and call that an output), I want to cache it without installation.

h-vetinari · 2024-11-27T23:23:04Z

cep-cache-output.md

+        - ${{ pin_subpackage("foo-headers") }}
+```
+
+Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output.


I'm in favour of this, just noting that this diverges from conda-build, which currently does not inject REs from the global build-step to the outputs.

I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host.

I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.

h-vetinari · 2024-11-27T23:38:57Z

cep-cache-output.md

+Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
+</details>
+
+The new files can then be used in the outputs with the `build.files` key:


How do I denote a dependence of a given output on the cache: phase? For example, we had a bunch of discussion in pytorch over conda-forge/pytorch-cpu-feedstock@981a04d (later reverted conda-forge/pytorch-cpu-feedstock@54fa422), because it is undocumented under what conditions a build will "fan" out (e.g. build multiple python versions on top of the same library; rather than generate a job per python version)

In other words, we need to have some clarity whether an output like

cache: requirements: build: - ${{ compiler('c') }} host: - # no python here outputs: - package: name: foo requirements: host: - # depends on cache implicitly? - python

together with a CBC of

python: - 3.9 - 3.10 - 3.11 - 3.12

will actually "fold" the python dimension into foo or not.

In conda-build, all outputs implicitly depend on the global build stage; perhaps we should make this explicit? build.files already does this partway (for outputs that only use files:). On the other hand, it might be more intuitive to keep the implication that cache: affects everything else - see discussion about skip:).

In the example repo you can try: https://github.com/wolfv/rattler-build-cache-test

It creates one libcalculator and two py-calculators based on the variants.yaml.

Also, right now, if there is a cache, all outputs are affected.

h-vetinari · 2024-11-27T23:44:33Z

cep-cache-output.md

+  build:
+    # note: in the current implementation, all other keys for build
+    #       are also valid, but unused
+    script: build.sh


One thing that's been a permanent annoyance with the way things have grown in conda-build is that a "global" build.skip: ... does not mean the entire build will get cancelled, because - in theory - the various outputs could do their own thing that doesn't rely on the package-level build. So in order to actually skip a given combination, it needs to be skipped on all outputs (which quickly gets annoying in large multi-output recipes, especially for quick debugging).

If we go with the "all outputs implicitly depend on cache:" approach, then I think it would be good to specify that a skip: true on the cache: level completely skips the build, without having to repeat it per output.

Hmm, I don't think the cache should have anything to do with skipping.

The top-level can skip fine (which is intersected with the outputs).

Ie. would skip everything on win:

build: skip: [win] outputs: - ...

wolfv · 2024-11-29T10:37:36Z

@h-vetinari I played around with also caching the "source" in prefix-dev/rattler-build#1226 and the example in https://github.com/wolfv/rattler-build-cache-test.

I now think that the cache should have its own source section. Then everything else will become quite straight-forward.

The example would then look like:

recipe:
  name: calculator
  version: 1.0.0

cache:
  source:
    path: ../

  requirements:
    build:
      - ${{ compiler('cxx') }}
      - cmake
      - ninja
  build:
    script:
      # make sure that `alternative_name.md` is not present
      - test ! -f ./alternative_name.md
      - mkdir build
      - cd build
      - cmake $SRC_DIR -GNinja ${CMAKE_ARGS}
      - ninja install

outputs:
  # this first output will include all files installed during the cache build
  - package:
      name: libcalculator

    requirements:
      run_exports:
        - ${{ pin_subpackage('libcalculator') }}

  - package:
      name: py-calculator
    source:
      - path: ../README.md
        file_name: alternative_name.md

    requirements:
      build:
        - ${{ compiler('cxx') }}
        - cmake
        - ninja
      host:
        - pybind11
        - python
        - libcalculator

    build:
      script:
        # assert that the README.md file is present
        - test -f ./alternative_name.md
        - cd build
        - cmake $SRC_DIR -GNinja ${CMAKE_ARGS} -DBUILD_PYTHON_BINDINGS=ON
        - ninja install

jaimergp · 2024-12-04T15:58:46Z

That's so close to the idea of cache being just a type of output! 🥳

outputs:
  - cache:
      name: optional, only needed if there are several caches
  - package:
      name: my-output
    build:
      needs-cache: bool # or cache.name

wolfv · 2024-12-04T16:14:45Z

Yeah it could be done, but I think we can let it wait for v2 of the recipe format! :)

h-vetinari · 2024-12-04T16:47:51Z

I now think that the cache should have its own source section. Then everything else will become quite straight-forward.

That sounds fine in principle. Could you explain a bit more what the thinking behind that is though? If there's already a source: on recipe-level, would the cache: have to repeat something?

That's so close to the idea of cache being just a type of output! 🥳

That would sound even better! More powerful and yet less divergence. (no opinion on whether this is v1 or v2 stuff)

However, this I don't like

outputs:
  # this first output will include all files installed during the cache build
  - package:
      name: libcalculator

I think this should never be automatic, and always need an explicit opt-in. Either the output says files: (in which case it could pick up parts of what cache: installed), or it actually does the ninja install itself (rather than during cache:). This would also be a better match for what the example is already doing for py-calculator.

carterbox · 2024-12-10T22:25:40Z

@carterbox, please review

carterbox · 2024-12-11T17:39:29Z

cep-cache-output.md

+    # run: ...
+    # run_constraints: ...
+    # ignore_run_exports: ...


It doesn't make sense that sections besides cache.build.script, cache.requirements.build and cache.requirements.host are allowed because this is a cache that would never be installed to an end-user environment.

The runtime dependencies don't need to be resolved for the cache because all of the relevant dependencies should be explicitly repeated in the actual outputs packages.

carterbox · 2024-12-11T18:05:10Z

cep-cache-output.md

+
+<details> 
+  <summary>script build.sh default discussion</summary>
+We had some debate wether the cache output should _also_ default to `build.sh` or should not have any default value for the `script`. This is still undecided.


In a multi-output recipe, do the outputs default to build.sh?

carterbox · 2024-12-11T18:16:25Z

cep-cache-output.md

+
+When the cache build is done, the newly created files are moved outside of the `host-prefix`. Post-processing is not performed on the files beyond memoizing what files contain the `$PREFIX` (which is later replaced in binaries and text files with the actual build-prefix with a simple 1-to-1 replacement as the prefix-length are guaranteed to be equal).
+
+The cache _only_ restores files that were added to the prefix (conda-build also restored source files). Sources are re-created from scratch and no intermediate state is kept.


This is unexpected. cache only caches files that are installed to $PREFIX? If you want to cache build artifacts, then you need to store them in $SCRATCH?

What about projects that do build-time source-code generation in-place? For example, cython projects which convert .pyx files into .c files. Or projects like magma which at build time duplicate the entire source from real float implementation into double and complex types (they don't use C++ templates). These generated sources will be need to be regenerated if the outputs need to reuse them (e.g. if you are trying to reuse the cache to amortize work across builds for multiple python versions)?

carterbox · 2024-12-11T18:32:37Z

cep-cache-output.md

+      name: libfoo
+    build:
+      files:
+        - lib/**


Are you going to support negative globs in outputs.build.files selection? For example:

- package: name: foo-dev build: files: include: - lib/** exclude: - lib/**.so.* - package: name: libfoo${{ soname }} build: files: include: - lib/**.so.*

Would be used to separate runtime libraries from development artifacts because only versioned shared objects are needed at runtime.

More recently, I've also been working on recipes that separate optional plugins into their own output. These are also typically installed to lib/

carterbox · 2024-12-11T18:43:43Z

cep-cache-output.md

+        - ${{ pin_subpackage("foo-headers") }}
+```
+
+Special care must be taken for `run-exports`. For example, the `${{ compiler('c') }}` package used to build the cache is going to have run-exports that need to be present at runtime for the package. To compute run-exports for the outputs, we use the union of the requirements - so virtually, the host & build dependencies of the cache are injected into the outputs and will attach their run exports to each output.


I am not in favor of this because it is not an explicit behavior and not every output will need the run_exports from the cache. For example, if your devel package doesn't have any binaries, it doesn't need to depend on any compiler package exports. If you split your binaries into multiple outputs, each output may only depend on a part of what was used in cache.requirements.host.

I suspect that this design decision (implicit inheritance of run exports from the cache) will lead to more overdepending warnings (which are ignored by most maintainers). In contrast, if we retain the current behavior which is that each output must explicitly enumerate their dependencies, there will be overlinking errors (which are errors that stop the build) when dependencies are missing.

add cep for the v1 recipe cache output

b527b4c

wolfv changed the title ~~add first version of channel registry cep~~ Define the cache output for v1 multi-output recipes Nov 27, 2024

wolfv changed the title ~~Define the cache output for v1 multi-output recipes~~ Define the cache key for v1 multi-output recipes Nov 27, 2024

h-vetinari reviewed Nov 27, 2024

View reviewed changes

carterbox reviewed Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the `cache` key for v1 multi-output recipes #102

Define the `cache` key for v1 multi-output recipes #102

wolfv commented Nov 27, 2024

h-vetinari left a comment

h-vetinari Nov 27, 2024

h-vetinari Nov 27, 2024

wolfv Nov 28, 2024

h-vetinari Dec 4, 2024

carterbox Dec 11, 2024

h-vetinari Nov 27, 2024

wolfv Nov 28, 2024

h-vetinari Nov 27, 2024

h-vetinari Nov 27, 2024

carterbox Dec 11, 2024 •

edited

Loading

h-vetinari Nov 27, 2024

wolfv Nov 28, 2024

wolfv Nov 28, 2024

h-vetinari Nov 27, 2024

wolfv Nov 28, 2024

wolfv commented Nov 29, 2024

jaimergp commented Dec 4, 2024

wolfv commented Dec 4, 2024

h-vetinari commented Dec 4, 2024

carterbox commented Dec 10, 2024

carterbox Dec 11, 2024

carterbox Dec 11, 2024

carterbox Dec 11, 2024

carterbox Dec 11, 2024

carterbox Dec 11, 2024 •

edited

Loading


		Sometimes it is very useful to build some code once, and then split it into multiple build artifacts (such as shared library, header files, etc.). For this reason, `conda-build` has a special, implicit top-level build.

		There are many downsides to the behavior of `conda-build`: it's very implicit, hard to understand and hard to debug (for example, if an output is defined with the same name as the top-level recipe, this output will get the same requirements attached as the top-level).


		When the cache build is done, the newly created files are moved outside of the `host-prefix`. Post-processing is not performed on the files beyond memoizing what files contain the `$PREFIX` (which is later replaced in binaries and text files with the actual build-prefix with a simple 1-to-1 replacement as the prefix-length are guaranteed to be equal).

		The cache _only_ restores files that were added to the prefix (conda-build also restored source files). Sources are re-created from scratch and no intermediate state is kept.

		Since this is not extremely convenient, we could discuss adding a "scratchpad" outside the `src` dir to rattler-build that will be kept between outputs and removed at the end of the whole build.
		</details>

Define the cache key for v1 multi-output recipes #102

Are you sure you want to change the base?

Define the cache key for v1 multi-output recipes #102

Conversation

wolfv commented Nov 27, 2024

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carterbox Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfv commented Nov 29, 2024

jaimergp commented Dec 4, 2024

wolfv commented Dec 4, 2024

h-vetinari commented Dec 4, 2024

carterbox commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carterbox Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Define the `cache` key for v1 multi-output recipes #102

Define the `cache` key for v1 multi-output recipes #102

carterbox Dec 11, 2024 •

edited

Loading

carterbox Dec 11, 2024 •

edited

Loading