Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU access in the sandbox #256230

Merged
merged 28 commits into from
Jun 26, 2024
Merged

Conversation

SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Sep 20, 2023

Description of changes

The PR introduces a way to expose devices and drivers (e.g. GPUs and /run/opengl-driver/lib) in the Nix build's sandbox for specially marked derivations. The approach is that described in #225912: we introduce a pre-build-hook which would scan the .drv being built for the presence of a chosen tag in the requiredSystemFeatures attribute and, upon a match, instruct Nix to mount a list of device-related paths. E.g. one could mark the derivation as mkDerivation { ...; requiredSystemFeatures = [ "cuda" ]; } to indicate that it would need to use CUDA during the build. Untagged derivations would still observe "pure" environment without any devices. Further, the requiredSystemFeatures attribute is special, Nix uses it to determine whether a host is capable of building the derivation, and which remote builders to choose. A host that hasn't been set up with nix.settings.system-features (system-features in nix.conf(5)) would refuse to build the marked derivation

One application of this hook is to implement GPU tests as normal derivations (cf. the torch example in the diff).
Another use could be to use Nix to run ML/DL/NumberCrunching pipelines that utilize GPUs (e.g. https://github.com/SomeoneSerge/pkgs/blob/5bf5ea2b16696e70ade5a68063042dcf2e8f033b/python-packages/by-name/om/omnimotion/package.nix#L270-L291). Similarly, one could write derivations that render things using OpenGL or Vulkan

How to test:

  1. Take a NixOS machine with an nvidia GPU and hardware.opengl.enable = true.

  2. Try building python3Packages.torch.tests.cudaAvailable, observe it's refused evaluation

  3. Rebulid with programs.nix-required-mounts.enable = true (

    # Note: nix-required-sandbox-paths extends system-features,
    # but they won't be merged with the nixos defaults, because they got a different priority
    # -> set manually:
    nix.settings.system-features = [
      "nixos-test"
      "benchmark"
      "big-parallel"
      "kvm"
    ];
    programs.nix-required-sandbox-paths.enable = true;
    
  4. Build python3Packages.torch.gpuChecks.cudaAvailable successfully

  5. Try dropping cuda from gpuChecks.cudaAvailable's requiredSystemFeatures, and observe the test fail

CC @NixOS/cuda-maintainers

Alternatives and related work

@tomberek suggests that one could have derivations ask for specific paths using __impureHostDeps, currently undocumented and being used for Darwin builds

For GPU tests, @Kiskae suggests one could instead use NixOS tests with the PCI passthrough

We could also possibly consider supporting variables like CUDA_VISIBLE_DEVICES and/or re-using nvidia-docker/nvidia-container-toolkit because that's the interface many (singularity, docker) rely on. These currently do not support any static configuration, and in particular they rely on glibc internals to discover the host drivers in a very non-cross-platform way.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.11 Release Notes (or backporting 23.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@github-actions github-actions bot added 6.topic: python 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Sep 20, 2023
@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Sep 20, 2023
@ofborg ofborg bot added 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Sep 22, 2023
@SomeoneSerge
Copy link
Contributor Author

@Madouura Madouura mentioned this pull request Oct 10, 2023
34 tasks
@SomeoneSerge SomeoneSerge changed the title WIP: cudaPackages: GPU tests WIP: GPU access in the sandbox Oct 16, 2023
@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Oct 16, 2023

@SomeoneSerge SomeoneSerge force-pushed the feat/gpu-tests-py branch 3 times, most recently from 73fbb93 to c57b043 Compare October 17, 2023 12:36
@SomeoneSerge
Copy link
Contributor Author

  • Added a NixOS test verifying that the sandbox is relaxed if and only if the derivation has asked for the respective requiredSystemFeature

@SomeoneSerge SomeoneSerge changed the title WIP: GPU access in the sandbox GPU access in the sandbox Oct 17, 2023
@SomeoneSerge SomeoneSerge marked this pull request as ready for review October 17, 2023 12:53
@SomeoneSerge SomeoneSerge force-pushed the feat/gpu-tests-py branch 3 times, most recently from b9037bb to c3ab1c8 Compare October 17, 2023 17:56
Comment on lines 56 to 78
proc = subprocess.run(
[
CONFIG["nixExe"],
"show-derivation",
drv_path,
],
capture_output=True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this requires the nix-command experimental feature. Should I just assume the 1st argument to always be the path to the .drv and read it directly?

Comment on lines 48 to 70
if not Path(drv_path).exists():
print(
f"[E] {drv_path} doesn't exist."
" This may happen with the remote builds."
" Exiting the hook",
file=stderr,
)
Copy link
Contributor Author

@SomeoneSerge SomeoneSerge Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using remote builds, the .drv apparently isn't available on the remote builder by the time when the pre-build-hook executes.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it boils down to this feature/optimization:

When you are privileged (or building a content-addressed derivation), you can build a derivation without uploading all the .drv closure first. This is of course less secure (because you can forge arbitrary output hashes) but faster if the closure contains very large files as inputs. Or if the closure is large. The .drv you send is not trusted, nor trustable, and not written to the builder store. Just used for building and sending back the result.

When you are not privileged (and not building a content-addressed derivation), you need to send the whole closure, and the .drv file will be present, as its validity can be recursively verified.

See https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/build-remote/build-remote.cc#L302-L334 And the referenced comment https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/libstore/daemon.cc#L589-L622

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It bugged me too much :-D NixOS/nix#9272

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh gosh, thanks a bunch @layus!

Copy link
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you a real og for this one @SomeoneSerge !! 🚀

"blender-cuda-available"
{
nativeBuildInputs = [ finalAttrs.finalPackage ];
requiredSystemFeatures = [ "cuda" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we could avoid having a stringly-typed API here? eg something like

Suggested change
requiredSystemFeatures = [ "cuda" ];
requiredSystemFeatures = [ system-features.cuda ];

I'm a fan of whatever works, but I think avoiding a stringly-typed interface would be preferable, ceteris paribus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no precedent for that AFAIK, but we could introduce some attribute in cudaPackages.
A related concern that I have (mentioned in the linked issue) is that in principle I'd like to have the feature name indicate the relevant cuda capabilities, but there are obvious complications...

Copy link
Contributor Author

@SomeoneSerge SomeoneSerge Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let me clarify: I made a preset with "cuda" because it seemed like the simplest string I could come up with. I'm wary of introducing other names or more infra around these names, because it's easy to slip down the rabbit hole. One obvious rabbit hole is that you could start making derivations that ask for requiredSystemFeatures = let formatFeature = capabilities: (concatStringsMapSep "-or-" (x: "sm_{x}") capabilities); [ (formatFeature cudaCapabilities) ] and deploying hosts that declare programs.nix-required-mounts.allowedPatterns.cuda.onFeatures = map allowformatFeature (genSubsets cudaCapabilities)

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final value, at least for now, would still be a string, but we could define somewhere a dictionary of valid literals and have people reference them rather than write the strings directly. Could be as silly as cudaPackages.cudaFeature = "cuda" and requiredSystemFeatures = [ cudaPackages.cudaFeature ].

We then could even extend this to cudaFeature = f cudaFlags

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what are your thoughts on modeling the capabilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (import <nixpkgs> { config.cudaCapabilities = [ "8.6" ]; }).blender.tests.cuda-available.requiredSystemFeatures = [ "cuda-sm_86" ] would be really great, but we run into the limitations of this mechanism very soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Nix with an external scheduler when)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should model capabilities -- they're essentially features of the hardware which determine how different pieces of software should be built and whether or not they can run on the system.

I do think it's a bit out of scope of this PR though; @SomeoneSerge would you make an issue so we can track this and have a longer discussion there?

pkgs/by-name/ni/nix-required-mounts/main.py Outdated Show resolved Hide resolved
pkgs/by-name/ni/nix-required-mounts/main.py Outdated Show resolved Hide resolved
pkgs/by-name/ni/nix-required-mounts/main.py Outdated Show resolved Hide resolved
@SomeoneSerge SomeoneSerge force-pushed the feat/gpu-tests-py branch 2 times, most recently from 2d3a5fd to 6212ea2 Compare October 19, 2023 08:44
An unwrapped check for `nix run`-ing on the host platform,
instead of `nix build`-ing in the sandbox

E.g.:

```
❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks.cudaAvailable.unwrapped
Blender 4.0.1
Read prefs: "/home/user/.config/blender/4.0/config/userpref.blend"
CUDA is available
Blender quit
❯ nix build -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks
blender> Blender 4.0.1
blender> could not get a list of mounted file-systems
blender> CUDA is available
blender> Blender quit
```
@SomeoneSerge
Copy link
Contributor Author

...merge this PR by next Monday

Wrong Monday. I now fixed the /dev/video* issue and added release notes. I also changed slightly the paths for these "gpu tests" (cf the commit message for d6b3abc). The exportReferencesGraph logic can be rewritten later as part of the effort for making nvidia's CDI integration work.

Waiting for ofborg and merging in the morning.

@GaetanLepage
Copy link
Contributor

Result of nixpkgs-review pr 256230 run on x86_64-linux 1

2 packages blacklisted:
  • nixos-install-tools
  • tests.nixos-functions.nixos-test
2 packages built:
  • nix-required-mounts
  • nix-required-mounts.dist

Runtime tests (derivations asking for a relaxed sandbox) are now
expected at p.gpuCheck, p.gpuChecks.<name>, or at
p.tests.<name>.gpuCheck.
@SomeoneSerge
Copy link
Contributor Author

This time around Ofborg dismisses passthru.tests (in bulk) based on licenses rather than requiredSystemFeatures. There are precedents for this already in Nixpkgs, e.g. python3Packages.jax.tests, so I'm not going to delay anymore. Properly resolving this depends either on NixOS/ofborg#678 or on NixOS/ofborg#644

❯ nix build --impure .#blender.tests.tester-cudaAvailable.gpuCheck -L # --impure is for the global allowUnfreePredicate
test-cuda> Blender 4.1.1
test-cuda> could not get a list of mounted file-systems
test-cuda> CUDA is available
test-cuda> Blender quit

🚅 🚄 🚄

@SomeoneSerge SomeoneSerge merged commit cb69dc5 into NixOS:master Jun 26, 2024
21 of 25 checks passed
@trofi
Copy link
Contributor

trofi commented Jul 7, 2024

In some locations test-cuda.nix is not present, fails the eval as:

$ nix build --no-link -f. python3Packages.pytorch-bin.gpuChecks
error:
       … while evaluating a branch condition
         at lib/customisation.nix:263:8:
          262|
          263|     in if missingArgs == {}
             |        ^
          264|        then makeOverridable f allArgs

       … while calling the 'listToAttrs' builtin
         at lib/attrsets.nix:647:5:
          646|     set:
          647|     listToAttrs (concatMap (name: let v = set.${name}; in if pred name v then [(nameValuePair name v)] else []) (attrNames set));
             |     ^
          648|

       (stack trace truncated; use '--show-trace' to show the full, detailed trace)

       error: path 'pkgs/development/python-modules/torch/test-cuda.nix' does not exist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: cuda Parallel computing platform and API 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: python 8.has: module (update) This PR changes an existing module in `nixos/` 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 1-10 10.rebuild-darwin: 1 10.rebuild-linux: 1-10 11.by: package-maintainer This PR was created by the maintainer of the package it changes 12.approvals: 3+ This PR was reviewed and approved by three or more reputable people
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.