GPU access in the sandbox #256230

SomeoneSerge · 2023-09-20T04:16:20Z

Description of changes

The PR introduces a way to expose devices and drivers (e.g. GPUs and /run/opengl-driver/lib) in the Nix build's sandbox for specially marked derivations. The approach is that described in #225912: we introduce a pre-build-hook which would scan the .drv being built for the presence of a chosen tag in the requiredSystemFeatures attribute and, upon a match, instruct Nix to mount a list of device-related paths. E.g. one could mark the derivation as mkDerivation { ...; requiredSystemFeatures = [ "cuda" ]; } to indicate that it would need to use CUDA during the build. Untagged derivations would still observe "pure" environment without any devices. Further, the requiredSystemFeatures attribute is special, Nix uses it to determine whether a host is capable of building the derivation, and which remote builders to choose. A host that hasn't been set up with nix.settings.system-features (system-features in nix.conf(5)) would refuse to build the marked derivation

One application of this hook is to implement GPU tests as normal derivations (cf. the torch example in the diff).
Another use could be to use Nix to run ML/DL/NumberCrunching pipelines that utilize GPUs (e.g. https://github.com/SomeoneSerge/pkgs/blob/5bf5ea2b16696e70ade5a68063042dcf2e8f033b/python-packages/by-name/om/omnimotion/package.nix#L270-L291). Similarly, one could write derivations that render things using OpenGL or Vulkan

How to test:

Take a NixOS machine with an nvidia GPU and hardware.opengl.enable = true.
Try building python3Packages.torch.tests.cudaAvailable, observe it's refused evaluation

Rebulid with programs.nix-required-mounts.enable = true (

# Note: nix-required-sandbox-paths extends system-features,
# but they won't be merged with the nixos defaults, because they got a different priority
# -> set manually:
nix.settings.system-features = [
  "nixos-test"
  "benchmark"
  "big-parallel"
  "kvm"
];
programs.nix-required-sandbox-paths.enable = true;

Build python3Packages.torch.gpuChecks.cudaAvailable successfully
Try dropping cuda from gpuChecks.cudaAvailable's requiredSystemFeatures, and observe the test fail

CC @NixOS/cuda-maintainers

Alternatives and related work

@tomberek suggests that one could have derivations ask for specific paths using __impureHostDeps, currently undocumented and being used for Darwin builds

For GPU tests, @Kiskae suggests one could instead use NixOS tests with the PCI passthrough

We could also possibly consider supporting variables like CUDA_VISIBLE_DEVICES and/or re-using nvidia-docker/nvidia-container-toolkit because that's the interface many (singularity, docker) rely on. These currently do not support any static configuration, and in particular they rely on glibc internals to discover the host drivers in a very non-cross-platform way.

Things done

SomeoneSerge · 2023-10-01T15:24:41Z

A slightly more involved usage example: https://github.com/SomeoneSerge/pkgs/blob/a23cebfa6c183538a89b7aedceccd1702400b9e2/python-packages/by-name/om/omnimotion/package.nix#L270-L288

SomeoneSerge · 2023-10-16T19:20:44Z

Added a test for blender as suggested by @the-furry-hubofeverything in cudaPackages: GPU-enabled tests #225912 (comment)
Updated the PR description to reflect that the approach can be used (is already being used) to manage generic GPU workloads, addressing cudaPackages: GPU-enabled tests #225912 (comment)
I'm going to rewrite the structure of the NixOS module and of the hook's config in a bit
I'd like to switch to the approach described by @Kiskae in cudaPackages: GPU-enabled tests #225912 (comment). Specifically, I want to expose the GPU checks such that they can be run outside the sandbox (what we use to verify our builds are correct, the user may utilize to verify the host configuration). These checks can then be wrapped to be exposed in passthru.tests to run with requiredSystemFeatures, or to be run as NixOS tests with PCI passthrough when somebody comes around to implement that

SomeoneSerge · 2023-10-17T12:36:41Z

Added a NixOS test verifying that the sandbox is relaxed if and only if the derivation has asked for the respective requiredSystemFeature

nixos/modules/programs/nix-required-mounts.nix

pkgs/by-name/ni/nix-required-mounts/package.nix

SomeoneSerge · 2023-10-17T18:37:35Z

pkgs/by-name/ni/nix-required-mounts/main.py

+    proc = subprocess.run(
+        [
+            CONFIG["nixExe"],
+            "show-derivation",
+            drv_path,
+        ],
+        capture_output=True,


IIRC this requires the nix-command experimental feature. Should I just assume the 1st argument to always be the path to the .drv and read it directly?

SomeoneSerge · 2023-10-17T18:39:58Z

pkgs/by-name/ni/nix-required-mounts/main.py

+    if not Path(drv_path).exists():
+        print(
+            f"[E] {drv_path} doesn't exist."
+            " This may happen with the remote builds."
+            " Exiting the hook",
+            file=stderr,
+        )


When using remote builds, the .drv apparently isn't available on the remote builder by the time when the pre-build-hook executes.

Thoughts?

I think it boils down to this feature/optimization:

When you are privileged (or building a content-addressed derivation), you can build a derivation without uploading all the .drv closure first. This is of course less secure (because you can forge arbitrary output hashes) but faster if the closure contains very large files as inputs. Or if the closure is large. The .drv you send is not trusted, nor trustable, and not written to the builder store. Just used for building and sending back the result.

When you are not privileged (and not building a content-addressed derivation), you need to send the whole closure, and the .drv file will be present, as its validity can be recursively verified.

See https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/build-remote/build-remote.cc#L302-L334 And the referenced comment https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/libstore/daemon.cc#L589-L622

It bugged me too much :-D NixOS/nix#9272

Oh gosh, thanks a bunch @layus!

samuela

you a real og for this one @SomeoneSerge !! 🚀

samuela · 2023-10-17T19:21:31Z

pkgs/applications/misc/blender/default.nix

+      "blender-cuda-available"
+      {
+        nativeBuildInputs = [ finalAttrs.finalPackage ];
+        requiredSystemFeatures = [ "cuda" ];


is there a way we could avoid having a stringly-typed API here? eg something like

Suggested change

requiredSystemFeatures = [ "cuda" ];

requiredSystemFeatures = [ system-features.cuda ];

I'm a fan of whatever works, but I think avoiding a stringly-typed interface would be preferable, ceteris paribus.

There's no precedent for that AFAIK, but we could introduce some attribute in cudaPackages.
A related concern that I have (mentioned in the linked issue) is that in principle I'd like to have the feature name indicate the relevant cuda capabilities, but there are obvious complications...

Actually, let me clarify: I made a preset with "cuda" because it seemed like the simplest string I could come up with. I'm wary of introducing other names or more infra around these names, because it's easy to slip down the rabbit hole. One obvious rabbit hole is that you could start making derivations that ask for requiredSystemFeatures = let formatFeature = capabilities: (concatStringsMapSep "-or-" (x: "sm_{x}") capabilities); [ (formatFeature cudaCapabilities) ] and deploying hosts that declare programs.nix-required-mounts.allowedPatterns.cuda.onFeatures = map allowformatFeature (genSubsets cudaCapabilities)

The final value, at least for now, would still be a string, but we could define somewhere a dictionary of valid literals and have people reference them rather than write the strings directly. Could be as silly as cudaPackages.cudaFeature = "cuda" and requiredSystemFeatures = [ cudaPackages.cudaFeature ].

We then could even extend this to cudaFeature = f cudaFlags

But what are your thoughts on modeling the capabilities?

I think (import <nixpkgs> { config.cudaCapabilities = [ "8.6" ]; }).blender.tests.cuda-available.requiredSystemFeatures = [ "cuda-sm_86" ] would be really great, but we run into the limitations of this mechanism very soon

(Nix with an external scheduler when)

I think we should model capabilities -- they're essentially features of the hardware which determine how different pieces of software should be built and whether or not they can run on the system.

I do think it's a bit out of scope of this PR though; @SomeoneSerge would you make an issue so we can track this and have a longer discussion there?

pkgs/by-name/ni/nix-required-mounts/main.py

pkgs/development/python-modules/torch/default.nix

Otherwise we crash Ofborg

An unwrapped check for `nix run`-ing on the host platform, instead of `nix build`-ing in the sandbox E.g.: ``` ❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks.cudaAvailable.unwrapped Blender 4.0.1 Read prefs: "/home/user/.config/blender/4.0/config/userpref.blend" CUDA is available Blender quit ❯ nix build -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks blender> Blender 4.0.1 blender> could not get a list of mounted file-systems blender> CUDA is available blender> Blender quit ```

SomeoneSerge · 2024-06-26T00:47:22Z

...merge this PR by next Monday

Wrong Monday. I now fixed the /dev/video* issue and added release notes. I also changed slightly the paths for these "gpu tests" (cf the commit message for d6b3abc). The exportReferencesGraph logic can be rewritten later as part of the effort for making nvidia's CDI integration work.

Waiting for ofborg and merging in the morning.

GaetanLepage · 2024-06-26T12:14:44Z

Result of nixpkgs-review pr 256230 run on x86_64-linux 1

2 packages blacklisted:

nixos-install-tools
tests.nixos-functions.nixos-test

2 packages built:

nix-required-mounts
nix-required-mounts.dist

Runtime tests (derivations asking for a relaxed sandbox) are now expected at p.gpuCheck, p.gpuChecks.<name>, or at p.tests.<name>.gpuCheck.

SomeoneSerge · 2024-06-26T19:16:43Z

This time around Ofborg dismisses passthru.tests (in bulk) based on licenses rather than requiredSystemFeatures. There are precedents for this already in Nixpkgs, e.g. python3Packages.jax.tests, so I'm not going to delay anymore. Properly resolving this depends either on NixOS/ofborg#678 or on NixOS/ofborg#644

❯ nix build --impure .#blender.tests.tester-cudaAvailable.gpuCheck -L # --impure is for the global allowUnfreePredicate
test-cuda> Blender 4.1.1
test-cuda> could not get a list of mounted file-systems
test-cuda> CUDA is available
test-cuda> Blender quit

🚅 🚄 🚄

trofi · 2024-07-07T06:47:51Z

In some locations test-cuda.nix is not present, fails the eval as:

$ nix build --no-link -f. python3Packages.pytorch-bin.gpuChecks
error:
       … while evaluating a branch condition
         at lib/customisation.nix:263:8:
          262|
          263|     in if missingArgs == {}
             |        ^
          264|        then makeOverridable f allArgs

       … while calling the 'listToAttrs' builtin
         at lib/attrsets.nix:647:5:
          646|     set:
          647|     listToAttrs (concatMap (name: let v = set.${name}; in if pred name v then [(nameValuePair name v)] else []) (attrNames set));
             |     ^
          648|

       (stack trace truncated; use '--show-trace' to show the full, detailed trace)

       error: path 'pkgs/development/python-modules/torch/test-cuda.nix' does not exist

github-actions bot added 6.topic: python 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Sep 20, 2023

SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Sep 20, 2023

ofborg bot added 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Sep 22, 2023

Madouura mentioned this pull request Oct 10, 2023

[Tracking] ROCm packages #197885

Open

34 tasks

SomeoneSerge changed the title ~~WIP: cudaPackages: GPU tests~~ WIP: GPU access in the sandbox Oct 16, 2023

SomeoneSerge force-pushed the feat/gpu-tests-py branch 3 times, most recently from 73fbb93 to c57b043 Compare October 17, 2023 12:36

SomeoneSerge commented Oct 17, 2023

View reviewed changes

nixos/modules/programs/nix-required-mounts.nix Outdated Show resolved Hide resolved

SomeoneSerge commented Oct 17, 2023

View reviewed changes

pkgs/by-name/ni/nix-required-mounts/package.nix Outdated Show resolved Hide resolved

SomeoneSerge changed the title ~~WIP: GPU access in the sandbox~~ GPU access in the sandbox Oct 17, 2023

SomeoneSerge force-pushed the feat/gpu-tests-py branch from c57b043 to 8aa7cd8 Compare October 17, 2023 12:52

SomeoneSerge marked this pull request as ready for review October 17, 2023 12:53

SomeoneSerge force-pushed the feat/gpu-tests-py branch 3 times, most recently from b9037bb to c3ab1c8 Compare October 17, 2023 17:56

SomeoneSerge commented Oct 17, 2023

View reviewed changes

samuela reviewed Oct 17, 2023

View reviewed changes

SomeoneSerge force-pushed the feat/gpu-tests-py branch 2 times, most recently from 2d3a5fd to 6212ea2 Compare October 19, 2023 08:44

SomeoneSerge mentioned this pull request Oct 19, 2023

cudaPackages: GPU-enabled tests #225912

Open

8 tasks

samuela reviewed Oct 19, 2023

View reviewed changes

pkgs/development/python-modules/torch/default.nix Outdated Show resolved Hide resolved

SomeoneSerge added 12 commits June 26, 2024 00:35

nix-required-mounts: allow overriding the rendered config

075dd8b

nixos/nix-required-mounts: mount the runtime closures

dd70727

nix-required-mounts: enforce that host paths exist

61001a3

nix-required-mounts: print -> logging

6a6b6ac

nixos/nix-required-mounts: allow passing extra arguments to the hook

927b15e

cudaPackages.saxpy: passthru: test gpu/runtime

9aa0403

cudaPackages: move cuda tests from passthru.tests

efd64b5

Otherwise we crash Ofborg

python3Packages.torch.gpuChecks: add rocm

39f3345

nix-required-mounts: cuda: /dev/video* may not exist and aren't relevant

ff430d1

nix-required-mounts: nixfmt

ebeb6b9

nix-required-mounts: refactor: drop unused arguments

7d667a0

SomeoneSerge force-pushed the feat/gpu-tests-py branch from cabc319 to d6b3abc Compare June 26, 2024 00:35

ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Jun 26, 2024

SomeoneSerge force-pushed the feat/gpu-tests-py branch from d6b3abc to df6d3d5 Compare June 26, 2024 00:57

ofborg bot added 10.rebuild-darwin: 1-10 10.rebuild-darwin: 1 10.rebuild-linux: 1-10 and removed 10.rebuild-darwin: 11-100 10.rebuild-linux: 11-100 labels Jun 26, 2024

cudaPackages: updated convention for gpu/runtime checks

79a7186

Runtime tests (derivations asking for a relaxed sandbox) are now expected at p.gpuCheck, p.gpuChecks.<name>, or at p.tests.<name>.gpuCheck.

SomeoneSerge force-pushed the feat/gpu-tests-py branch from df6d3d5 to 79a7186 Compare June 26, 2024 17:43

SomeoneSerge merged commit cb69dc5 into NixOS:master Jun 26, 2024
21 of 25 checks passed

jeremyschlatter mentioned this pull request Jun 26, 2024

python3Packages.torch: fix tests.*.gpuCheck #322761

Merged

13 tasks

SomeoneSerge mentioned this pull request Jul 7, 2024

python3Packages.torch-bin: gpuChecks -> tests.tester-<name>.gpuCheck #325222

Merged

13 tasks

SomeoneSerge mentioned this pull request Jul 20, 2024

Fine-grained impurity NixOS/nix#8865

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU access in the sandbox #256230

GPU access in the sandbox #256230

SomeoneSerge commented Sep 20, 2023 •

edited

Loading

SomeoneSerge commented Oct 1, 2023

SomeoneSerge commented Oct 16, 2023 •

edited

Loading

SomeoneSerge commented Oct 17, 2023

SomeoneSerge Oct 17, 2023

SomeoneSerge Oct 17, 2023 •

edited

Loading

layus Oct 31, 2023

layus Nov 1, 2023

SomeoneSerge Nov 1, 2023

samuela left a comment

samuela Oct 17, 2023

SomeoneSerge Oct 17, 2023

SomeoneSerge Oct 19, 2023 •

edited

Loading

This comment was marked as outdated.

SomeoneSerge Oct 31, 2023

SomeoneSerge Oct 31, 2023

SomeoneSerge Oct 31, 2023

SomeoneSerge Oct 31, 2023

ConnorBaker Nov 2, 2023

SomeoneSerge commented Jun 26, 2024

GaetanLepage commented Jun 26, 2024

SomeoneSerge commented Jun 26, 2024

trofi commented Jul 7, 2024

	requiredSystemFeatures = [ "cuda" ];
	requiredSystemFeatures = [ system-features.cuda ];

GPU access in the sandbox #256230

GPU access in the sandbox #256230

Conversation

SomeoneSerge commented Sep 20, 2023 • edited Loading

Description of changes

Alternatives and related work

Things done

SomeoneSerge commented Oct 1, 2023

SomeoneSerge commented Oct 16, 2023 • edited Loading

SomeoneSerge commented Oct 17, 2023

Choose a reason for hiding this comment

SomeoneSerge Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuela left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SomeoneSerge Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SomeoneSerge commented Jun 26, 2024

GaetanLepage commented Jun 26, 2024

SomeoneSerge commented Jun 26, 2024

trofi commented Jul 7, 2024

SomeoneSerge commented Sep 20, 2023 •

edited

Loading

SomeoneSerge commented Oct 16, 2023 •

edited

Loading

SomeoneSerge Oct 17, 2023 •

edited

Loading

SomeoneSerge Oct 19, 2023 •

edited

Loading