-
-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaPackages: GPU-enabled tests #225912
Comments
@SomeoneSerge would you envision these as tests which would ideally be run during Or is this something different? |
Consider something like this: { ..., torch }:
buildPythonPackage {
pname = "torch";
# ...
passthru.tests.gpuTests = torch.overridePythonAttrs (_: {
requiredSystemFeatures = [ "expose-cuda" ];
});
passthru.tests.cudaAvailable = buildPythonPackage {
# ...
requiredSystemFeatures = [ "expose-cuda" ];
checkPhase = ''
python << EOF
import torch
assert torch.cuda.is_available()
EOF
'';
};
# ...
} Any normal Nix deployment should reject to build Pros: an easy way to maintain a basic test-suite for our packages' GPU functionality within and synchronized with nixpkgs? |
You can actually do that using the pre-build-hook feature. I haven't used it myself so take that with a grain of salt, but I suspect that something like the one below would work: #!/bin/sh
DRV="$1"
# Do we have the "expose-cuda" required feature?
if nix derivation show "$DRV" | jq --exit-status '.["'"$DRV"'"].env.requiredSystemFeatures | contains("expose-cuda")'; then
echo "extra-sandbox-paths"
echo "/run/opengl-driver/lib=/run/opengl-driver/lib"
echo "/dev=/dev"
fi |
I think this works! Cf. https://gist.github.com/SomeoneSerge/4832997ab09e4e71301e5469eec3066a On a correctly configured builder, declaring an ❯ nix build --file with-cuda.nix python3Packages.pynvml.tests.testNvmlInit -L
...
python3.10-pynvml> running install tests
python3.10-pynvml> enter: nvmlInit
python3.10-pynvml> pynvml.nvmlInit()=None
python3.10-pynvml> exit: nvmlInit
...
python3.10-pynvml> Check whether the following modules can be imported: pynvml pynvml.smi A builder doesn't declare ❯ nix build --file with-my-cuda.nix python3Packages.pynvml.tests.testNvmlInit -L --rebuild
error: a 'x86_64-linux' with features {expose-cuda} is required to build '/nix/store/94vw78sgh3y92bx3rmk62cdgg9nakkrx-python3.10-pynvml-11.5.0.drv', but I am a 'x86_64-linux' with features {benchmark, big-parallel, ca-derivations, kvm, nixos-test} Behaviour when ❯ nix build --file with-cuda.nix python3Packages.pynvml.tests.testNvmlInit -L
python3.10-pynvml> enter: nvmlInit
python3.10-pynvml> Driver Not Loaded
python3.10-pynvml> exit: nvmlInit
...
python3.10-pynvml> pynvml.nvml.NVMLError_Uninitialized: Uninitialized |
The next step could be to prepare a PR, introducing
We should expect a "why don't you do it in a flake" response, among other things. |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/cuda-team-roadmap-and-call-for-sponsors/29495/1 |
Well... I created a NixOS module flake here anyway, but I also prefer this to be available from the main repo. I also tried to do this from a normal/package flake by setting {
nixConfig = {
extra-sandbox-paths = [
"/dev/nvidia0"
"/dev/nvidiactl"
"/dev/nvidia-modeset"
"/dev/nvidia-uvm"
"/dev/nvidia-uvm-tools"
"/run/opengl-driver"
# build will fail without this:
"/nix/store/fq7vp75q1f1yd5ypd0mxv1c935xl4j2b-nvidia-x11-535.54.03-6.1.38/"
];
};
inputs.nixpkgs.url = "github:nixos/nixpkgs/nixpkgs-unstable";
outputs = { self, nixpkgs }:
let
system = "x86_64-linux";
pkgs = import nixpkgs {
inherit system;
config.allowUnfree = true;
config.cudaSupport = true;
};
pytorch = pkgs.python3.withPackages (p: [p.pytorch]);
package = pkgs.stdenvNoCC.mkDerivation {
name = "cuda-test";
unpackPhase = "true";
buildInputs = [ pytorch ];
buildPhase = ''
echo == torch check:
python -c "import torch; print(torch.cuda.is_available())"
echo
echo == link check
ls -l /run/opengl-driver/lib/libcuda.so
echo
echo == readlink will error if target is not accessible
readlink -f /run/opengl-driver/lib/libcuda.so
'';
};
in {
packages.${system} = {
default = package;
};
};
} The sandbox doesn't include access to symlink targets due to any security/performance concerns, or is this a feature it should have? |
🎉
Not even that, I'd say recursive resolution of symlinks would be non-trivial extra work and potentially surprising behaviour on the sandbox's part, which is a good enough reason not to have it?
Hiding the hardware by default really is more of a feature. Many builds do auto-detect cuda capabilities and behave differently depending on the results
I updated the issue description to reflect how I think this could be merged in nixpkgs in smaller steps. I thought I'd work on this issue myself this week, but I lag behind the schedule and now won't be able to act on this until August. I'd be happy if kept going with your work and got it accepted upstream |
I don't have enough experience with CUDA/Nix to help with this more granular features. But one thing I just noticed while experimenting with jax library is that it also tries to access
|
Is there any way a derivation can refer to the driver currently in use? In my previous example, I realized we can remove the extra path to |
"the driver currently in use" is a runtime (e.g. it makes sense in NixOS, but not in nixpkgs) concept, nix packages don't know anything about those at build time; this is the reason NixOS deploys driver-related libraries impurely at EDIT: you can reference "the (nvidia) driver currently in use" in NixOS as |
I forgot to mention this earlier, but I had experienced some issues with using as a remote builder the machine with that hook set up. IIRC the show-derivation was failing due to the |
An alternative approach is to have the test provide their own cuda drivers by defining them as nixos tests. Then you could use pci-passthrough to provide it with a real GPU to run the actual tests. Configuring the qemu VM for passthrough could be done with a build hook to ensure only a single test per GPU can run, or if you formulate the tests as a flake you could use flake overrides to inject the configuration during CI. |
Blender has python integration, you could list out gpu devices that are seen by blender with a python script https://blender.stackexchange.com/questions/208135/way-to-detect-if-a-valid-gpu-device-is-available-from-python |
I'm thinking these GPU enabled builds would also be useful for general machine learning tasks/pipelines (reproducible model building etc), not just for package testing. |
It might be an idea to work on creating runnable tests for gpu-enabled applications as a separate issue to actually running those tests. Suppose you define a list on The tests themselves would remain the same regardless of the final solution. |
@Kiskae have you got an example of how to do the PCI passthrough/an understanding of how to integrate that with the NixOS test framework? |
https://ubuntu.com/server/docs/gpu-virtualization-with-qemu-kvm or https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF covers it pretty well. The required qemu arguments can be set through https://search.nixos.org/options?channel=23.05&show=virtualisation.qemu.options and in theory that would be enough to bind the gpu to the VM. However there is little documentation about other requirements so it might take some trial and error to get it to work. |
@SomeoneSerge yeah, it looks like vfio/iommu has their own device nodes that are required: https://docs.kernel.org/driver-api/vfio.html#vfio-usage-example I guess that usually PCI passthrough gets used by privileged users so the required system access isn't well documented |
Description
"cuda"
/"expose-cuda"
value fornix.settings.system-features
cuda-50-60-61-70-75-80-86-89-90
(the set of architectures for which nvidia ships device code in cuda 12.2 for x86_64-linux) andcuda-53-61-62-70-72-75-80-86-87
(same forlinux-aarch64
a.k.a. jetson). This way the only test-able packages would be ones built to support a wide range of GPU architecturescudaPackages.preBuildHook
. Make the last new-line terminator optional. At this point users may manage their pre-build-hook directly as:preBuildHook
would be to expose cuda devices to those derivations marked withrequiredSystemFeatures = [ "cuda" ]
, and only to them. 2023-03-26: Cf. an example, also 2023-07-20: as a flake. Thanks to @thufschmitt for pointing out that this behaviour may be implement using pre-build hooks..override
-ableList str
parameter to the hook's derivation, so that non-NixOS users may specify custom locations forlibcuda.so
different fromaddOpenGLRunpath.driverLink
pre-build-hook
.bool
option for conditionally exposing cuda devices. The option would enable the hook and extendsystem-features
pre-build-hook
(not specific to cuda), point out that it doesn't have to come in the same PROld description:
Test GPU functionality in
passthru.tests
.Mark GPU tests with something like
requiredSystemFeatures = [ "cuda" ]
.Conditionally expose
/dev
and/run/opengl-driver/lib
(and/or whatever is required to make GPU tests work) inextra-sandbox-paths
for derivations marked with"cuda"
inrequiredSystemFeatures
.Ensure normal derivations cannot see these extra paths.
Set up a PoC CI that would run these tests
The text was updated successfully, but these errors were encountered: