Looking for the right way to run a local CUDA container #472

nzwulfin · 2024-11-20T02:15:07Z

nzwulfin
Nov 20, 2024

I have a system with the NVIDIA drivers, CUDA, and CDI installed, and a hacked up way that appears to be building
the CUDA container correctly to local storage (see #471). I installed ramalama via the install.sh.

If I run a nvidia-smi test and pass the devices using the local CUDA container, I can see the GPU

podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable quay.io/ramalama/cuda nvidia-smi -L

==========
== CUDA ==
==========

CUDA Version 12.6.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-03a09822-7bb4-f6eb-5825-74533a766f07)

What I can't seem to figure out is how to get the device passed when using ramalama, it doesn't seem to recognize the GPU or try the CUDA container.

ramalama --gpu run llama3.2
GPU offload was requested but is not available on this system
>

I've tried using the --image flag (if I understood the doc right) but that doesn't see the GPU.

ramalama --image quay.io/ramalama/cuda run llama3.2

==========
== CUDA ==
==========

CUDA Version 12.6.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

I expect there's a configuration somewhere I missed to connect the CDI to ramalama? How do I get ramalama to use this GPU?

Answered by bmahabirbu

Nov 20, 2024

@nzwulfin Try my fork of ramalama using the nv-simple branch
https://github.com/bmahabirbu/ramalama/tree/nv-simple

Give it a run and see if it works. Youll need the Nvidia container toolkit package for this to work! (Assuming you already have a cuda driver installed as well). I pushed a build that works into docker hub temporarily for testing purposes.

Here is an example of how to run it without installing ramalama

git clone https://github.com/bmahabirbu/ramalama.git
cd ramalama
git checkout nv-simple
./bin/ramalama run llama3.2

in the meantime run ramalama with the debug flag like so

ramalama  --debug run llama3

look for the exec_cmd and copy everything before the /bin/sh command. You …

View full answer

ericcurtin · 2024-11-20T05:35:24Z

ericcurtin
Nov 20, 2024
Maintainer

@bmahabirbu got this running, dunno if you need Nvidia container toolkit... Let's keep making Nvidia boring (I know the lack of publishing sucks though)

0 replies

bmahabirbu · 2024-11-20T06:46:08Z

bmahabirbu
Nov 20, 2024

@nzwulfin Try my fork of ramalama using the nv-simple branch
https://github.com/bmahabirbu/ramalama/tree/nv-simple

Give it a run and see if it works. Youll need the Nvidia container toolkit package for this to work! (Assuming you already have a cuda driver installed as well). I pushed a build that works into docker hub temporarily for testing purposes.

Here is an example of how to run it without installing ramalama

git clone https://github.com/bmahabirbu/ramalama.git
cd ramalama
git checkout nv-simple
./bin/ramalama run llama3.2

in the meantime run ramalama with the debug flag like so

ramalama  --debug run llama3

look for the exec_cmd and copy everything before the /bin/sh command. You can use that command to run the container with proper mounts to the models stored in ramalama. you can change the container name as well to test different ones making debugging easier

This is what mine looks like

podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_Gh1Wo8JZ6J -t --device /dev/dri --mount=type=bind,src=/home/bmahabir/.local/share/ramalama/models/ollama/llama3:latest,destination=/mnt/models/model.file,rw=false docker.io/brianmahabir/rama-cuda:v1

In general the -device /dev/dri part of the podman command enables the container to see the GPU

once inside the container run llama-simple-chat -m /mnt/models/model.file -ngl 99 -c 2048 and see what you get

There is no additional flag on ramalama to get GPU running on ramalama btw!

1 reply

ericcurtin Nov 20, 2024
Maintainer

-c 2048

nzwulfin · 2024-11-20T13:27:52Z

nzwulfin
Nov 20, 2024
Author

Thanks! Disecting the podman run and llama chat commands really helped. For some reason, --device /dev/dri doesn't work on my system but --device nvidia.com/gpu=all. I think the inclusion of the on-board AMD iGPU and the NVIDIA card confuses the CDI?

podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_AGKhPBnNMP -t --device  nvidia.com/gpu=all --device /dev/kfd --mount=type=bind,src=/home/mrman/.local/share/ramalama/models/ollama/llama3.2:latest,destination=/mnt/models/model.file,rw=false quay.io/ramalama/cuda:latest

3 replies

bmahabirbu Nov 20, 2024

Interesting I tested --device nvidia.com/gpu=all and am getting this error Error: stat nvidia.com/gpu=all: no such file or directory

I'll switch to using --gpus=all which is a legacy way of doing it. @nzwulfin would you mind confirming if this works instead of using --device nvidia.com/gpu=all?

Also I should mention when gpu support is enabled the chatbox will hang for 10-15 seconds after entering the first prompt as the model is loaded into vram.

nzwulfin Nov 20, 2024
Author

That also worked for me, chatbot spins up and I can see activity if i watch nvidia-smi in another terminal.

I took that device line from the CDI page on Podman so I don't really know what the right or wrong ways to invoke the devices are :) Maybe check the /etc/cdi/nvidia.yaml?

I think that's why it gets confused with /dev/dri including the AMD iGPU, that might be showing up as card0 and the CDI config thinks card0 should be the NVIDIA GPU.

bmahabirbu Nov 20, 2024

Gotcha, doing --gpus=all is the correct way as it mimicks the docker argumenet. In retrospect if you know what card you would like to use you can specify that in --device /dev/dri/card0 and it will only give access to that card.

Seperately llama.cpp has a way of selecting the GPU as well but using the simple chat version I'm not sure if it has the capability. I'd say the main goal is for ramalama to auto select the GPU based on vram first so we don't need to mess with configuration inside the container.

nzwulfin · 2024-11-20T14:54:20Z

nzwulfin
Nov 20, 2024
Author

I also realized from your branch that CUDA_VISIBLE_DEVICES is what ramalama is using to enable GPU support but I don't see where that is actual set. So am I right thinking that at the moment, there's no auto-detection for NVIDIA GPUs (or via ramalama --gpu?

4 replies

rhatdan Nov 20, 2024
Maintainer

Nope gpu detection is the goal, but it is very limited at this stage. PRs welcomed.

nzwulfin Nov 20, 2024
Author

Right on, thanks for confirming!

bmahabirbu Nov 20, 2024

Correct as of right now ramalama isnt detecting the nvidia GPU automatically. My branch assumes that you have the GPU already and makes the necessary configurations to enable support.

ericcurtin Nov 20, 2024
Maintainer

Yeah correct right now you have to manually set that env var for Nvidia, so the logic is right now:

If it's set use it
Otherwise auto-detect GPU, which is not complete for Nvidia

nzwulfin · 2024-11-20T17:14:51Z

nzwulfin
Nov 20, 2024
Author

I'm making a lot of assumptions because I only have access to one system with an NVIDIA card and it already has CUDA installed an working, so I don't know how stable or useful this might be for detection.

The CUDA driver packages are probably a requirement for any of this to work, and the packages in rpm-fusion include the nvidia-smi command. This can be used to query the gpu for details including something called compute_cap:

"compute_cap"
The CUDA Compute Capability, represented as Major DOT Minor.

On my system with a 3060 this is what the command output looks like

/usr/bin/nvidia-smi --query-gpu compute_cap --format=csv,noheader
8.6

So maybe a test for the binary and then a non-zero response to the specific compute_cap query via an os.cmd? I don't know what that query would give on different cards.

>>> import os
>>> cmd = '/usr/bin/nvidia-smi --query-gpu compute_cap --format=csv,noheader'
>>> val = os.popen(cmd).read()
>>> float(val)
8.6
>>> if float(val) > 1.0:
...     print(val)
...     
8.6

Thoughts?

0 replies

ericcurtin · 2024-11-20T18:33:48Z

ericcurtin
Nov 20, 2024
Maintainer

Describing what we are trying to do is auto-select the primary GPU in a system to use (but one may manually pick a GPU also). The simplest metric to do this, is the GPU with the most VRAM. We would like to do this with minimal dependencies as possible and if we need to use dependencies in the container image if possible (if you see the AMD implementation, it's all done in python). Also if all the GPUs have < 1GB VRAM, we probably want to just automatically do CPU inferencing.

3 replies

nzwulfin Nov 20, 2024
Author

Makes sense, I'm guessing the VRAM check is what is happening here, on my system that picks up my AMD iGPU (Ryzen 9 9900x). I looked through /sys/bus/pci for something similar for the NVIDIA discrete GPU and I don't see anything that looks like memory reporting.

Everything I can find to collect information seems to need some sort of NVIDIA proprietary something, even the python bindings for their libraries.

What's the current right way to select a GPU manually? Is it what Brian showed me up thread?

ericcurtin Nov 20, 2024
Maintainer

Yeah you are right that is where we do it for AMD, not sure what's the best way to do that for Nvidia

I think you can manually pick a GPU via those env vars you guys have seen or it might not be implemented. A lot of stuff in RamaLama is partially implemented so feel free to help us out completing this stuff.

Btw I'm currently on vacation so I'm saying a lot of this stuff based on memory rather than reading the code so not everything I say is 100% accurate. Answering on a phone.

nzwulfin Nov 20, 2024
Author

You should definitely be vacationing than responding to any of this, thanks for responding at all!

nzwulfin · 2024-11-21T20:50:56Z

nzwulfin
Nov 21, 2024
Author

@bmahabirbu I'm about to head on vacation for a few days, but I ginned up this detection that works locally, wondering if it works for you. Unlike Eric, I won't be able to see questions on my phone :)

I'm running this on Fedora 41 with a 3060 12G, nvidia-container-toolkit-1.17.2-1, xorg-x11-drv-nvidia-565.57.01-3.fc41, and xorg-x11-drv-nvidia-cuda-565.57.01-3

It uses nvidia-smi which may not be an overall good direction. my very elegant test to fail is just renaming /usr/bin/nvidia-smi. Mostly just looking for feedback, I get the detection direction might be headed in different ways. It's been a little while since I've done a lot of python.

https://github.com/nzwulfin/ramalama/tree/nv-detect

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for the right way to run a local CUDA container #472

{{title}}

Replies: 7 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Looking for the right way to run a local CUDA container #472

nzwulfin Nov 20, 2024

Replies: 7 comments · 11 replies

ericcurtin Nov 20, 2024 Maintainer

bmahabirbu Nov 20, 2024

ericcurtin Nov 20, 2024 Maintainer

nzwulfin Nov 20, 2024 Author

bmahabirbu Nov 20, 2024

nzwulfin Nov 20, 2024 Author

bmahabirbu Nov 20, 2024

nzwulfin Nov 20, 2024 Author

rhatdan Nov 20, 2024 Maintainer

nzwulfin Nov 20, 2024 Author

bmahabirbu Nov 20, 2024

ericcurtin Nov 20, 2024 Maintainer

nzwulfin Nov 20, 2024 Author

ericcurtin Nov 20, 2024 Maintainer

nzwulfin Nov 20, 2024 Author

ericcurtin Nov 20, 2024 Maintainer

nzwulfin Nov 20, 2024 Author

nzwulfin Nov 21, 2024 Author

nzwulfin
Nov 20, 2024

Replies: 7 comments 11 replies

ericcurtin
Nov 20, 2024
Maintainer

bmahabirbu
Nov 20, 2024

ericcurtin Nov 20, 2024
Maintainer

nzwulfin
Nov 20, 2024
Author

nzwulfin Nov 20, 2024
Author

nzwulfin
Nov 20, 2024
Author

rhatdan Nov 20, 2024
Maintainer

nzwulfin Nov 20, 2024
Author

ericcurtin Nov 20, 2024
Maintainer

nzwulfin
Nov 20, 2024
Author

ericcurtin
Nov 20, 2024
Maintainer

nzwulfin Nov 20, 2024
Author

ericcurtin Nov 20, 2024
Maintainer

nzwulfin Nov 20, 2024
Author

nzwulfin
Nov 21, 2024
Author