-
-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU access in the sandbox #256230
GPU access in the sandbox #256230
Conversation
A slightly more involved usage example: https://github.com/SomeoneSerge/pkgs/blob/a23cebfa6c183538a89b7aedceccd1702400b9e2/python-packages/by-name/om/omnimotion/package.nix#L270-L288 |
|
73fbb93
to
c57b043
Compare
|
c57b043
to
8aa7cd8
Compare
b9037bb
to
c3ab1c8
Compare
proc = subprocess.run( | ||
[ | ||
CONFIG["nixExe"], | ||
"show-derivation", | ||
drv_path, | ||
], | ||
capture_output=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this requires the nix-command
experimental feature. Should I just assume the 1st argument to always be the path to the .drv
and read it directly?
if not Path(drv_path).exists(): | ||
print( | ||
f"[E] {drv_path} doesn't exist." | ||
" This may happen with the remote builds." | ||
" Exiting the hook", | ||
file=stderr, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using remote builds, the .drv
apparently isn't available on the remote builder by the time when the pre-build-hook executes.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it boils down to this feature/optimization:
When you are privileged (or building a content-addressed derivation), you can build a derivation without uploading all the .drv closure first. This is of course less secure (because you can forge arbitrary output hashes) but faster if the closure contains very large files as inputs. Or if the closure is large. The .drv you send is not trusted, nor trustable, and not written to the builder store. Just used for building and sending back the result.
When you are not privileged (and not building a content-addressed derivation), you need to send the whole closure, and the .drv file will be present, as its validity can be recursively verified.
See https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/build-remote/build-remote.cc#L302-L334 And the referenced comment https://github.com/NixOS/nix/blob/d12c614ac75171421844f3706d89913c3d841460/src/libstore/daemon.cc#L589-L622
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It bugged me too much :-D NixOS/nix#9272
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh gosh, thanks a bunch @layus!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you a real og for this one @SomeoneSerge !! 🚀
"blender-cuda-available" | ||
{ | ||
nativeBuildInputs = [ finalAttrs.finalPackage ]; | ||
requiredSystemFeatures = [ "cuda" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way we could avoid having a stringly-typed API here? eg something like
requiredSystemFeatures = [ "cuda" ]; | |
requiredSystemFeatures = [ system-features.cuda ]; |
I'm a fan of whatever works, but I think avoiding a stringly-typed interface would be preferable, ceteris paribus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no precedent for that AFAIK, but we could introduce some attribute in cudaPackages
.
A related concern that I have (mentioned in the linked issue) is that in principle I'd like to have the feature name indicate the relevant cuda capabilities, but there are obvious complications...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let me clarify: I made a preset with "cuda"
because it seemed like the simplest string I could come up with. I'm wary of introducing other names or more infra around these names, because it's easy to slip down the rabbit hole. One obvious rabbit hole is that you could start making derivations that ask for requiredSystemFeatures = let formatFeature = capabilities: (concatStringsMapSep "-or-" (x: "sm_{x}") capabilities); [ (formatFeature cudaCapabilities) ]
and deploying hosts that declare programs.nix-required-mounts.allowedPatterns.cuda.onFeatures = map allowformatFeature (genSubsets cudaCapabilities)
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final value, at least for now, would still be a string, but we could define somewhere a dictionary of valid literals and have people reference them rather than write the strings directly. Could be as silly as cudaPackages.cudaFeature = "cuda"
and requiredSystemFeatures = [ cudaPackages.cudaFeature ]
.
We then could even extend this to cudaFeature = f cudaFlags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what are your thoughts on modeling the capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (import <nixpkgs> { config.cudaCapabilities = [ "8.6" ]; }).blender.tests.cuda-available.requiredSystemFeatures = [ "cuda-sm_86" ]
would be really great, but we run into the limitations of this mechanism very soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Nix with an external scheduler when)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should model capabilities -- they're essentially features of the hardware which determine how different pieces of software should be built and whether or not they can run on the system.
I do think it's a bit out of scope of this PR though; @SomeoneSerge would you make an issue so we can track this and have a longer discussion there?
2d3a5fd
to
6212ea2
Compare
Otherwise we crash Ofborg
An unwrapped check for `nix run`-ing on the host platform, instead of `nix build`-ing in the sandbox E.g.: ``` ❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks.cudaAvailable.unwrapped Blender 4.0.1 Read prefs: "/home/user/.config/blender/4.0/config/userpref.blend" CUDA is available Blender quit ❯ nix build -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L blender.gpuChecks blender> Blender 4.0.1 blender> could not get a list of mounted file-systems blender> CUDA is available blender> Blender quit ```
cabc319
to
d6b3abc
Compare
Wrong Monday. I now fixed the Waiting for ofborg and merging in the morning. |
d6b3abc
to
df6d3d5
Compare
Result of 2 packages blacklisted:
2 packages built:
|
Runtime tests (derivations asking for a relaxed sandbox) are now expected at p.gpuCheck, p.gpuChecks.<name>, or at p.tests.<name>.gpuCheck.
df6d3d5
to
79a7186
Compare
This time around Ofborg dismisses
🚅 🚄 🚄 |
In some locations
|
Description of changes
The PR introduces a way to expose devices and drivers (e.g. GPUs and
/run/opengl-driver/lib
) in the Nix build's sandbox for specially marked derivations. The approach is that described in #225912: we introduce a pre-build-hook which would scan the.drv
being built for the presence of a chosen tag in therequiredSystemFeatures
attribute and, upon a match, instruct Nix to mount a list of device-related paths. E.g. one could mark the derivation asmkDerivation { ...; requiredSystemFeatures = [ "cuda" ]; }
to indicate that it would need to use CUDA during the build. Untagged derivations would still observe "pure" environment without any devices. Further, therequiredSystemFeatures
attribute is special, Nix uses it to determine whether a host is capable of building the derivation, and which remote builders to choose. A host that hasn't been set up withnix.settings.system-features
(system-features
innix.conf(5)
) would refuse to build the marked derivationOne application of this hook is to implement GPU tests as normal derivations (cf. the torch example in the diff).
Another use could be to use Nix to run ML/DL/NumberCrunching pipelines that utilize GPUs (e.g. https://github.com/SomeoneSerge/pkgs/blob/5bf5ea2b16696e70ade5a68063042dcf2e8f033b/python-packages/by-name/om/omnimotion/package.nix#L270-L291). Similarly, one could write derivations that render things using OpenGL or Vulkan
How to test:
Take a NixOS machine with an nvidia GPU and
hardware.opengl.enable = true
.Try building
python3Packages.torch.tests.cudaAvailable
, observe it's refused evaluationRebulid with
programs.nix-required-mounts.enable = true
(Build
python3Packages.torch.gpuChecks.cudaAvailable
successfullyTry dropping
cuda
fromgpuChecks.cudaAvailable
'srequiredSystemFeatures
, and observe the test failCC @NixOS/cuda-maintainers
Alternatives and related work
@tomberek suggests that one could have derivations ask for specific paths using
__impureHostDeps
, currently undocumented and being used for Darwin buildsFor GPU tests, @Kiskae suggests one could instead use NixOS tests with the PCI passthrough
We could also possibly consider supporting variables like
CUDA_VISIBLE_DEVICES
and/or re-using nvidia-docker/nvidia-container-toolkit because that's the interface many (singularity, docker) rely on. These currently do not support any static configuration, and in particular they rely on glibc internals to discover the host drivers in a very non-cross-platform way.Things done
sandbox = true
set innix.conf
? (See Nix manual)nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)