Skip to content

Commit

Permalink
amd_gpu: add udev rules to bypass the 'render' group
Browse files Browse the repository at this point in the history
On k8s nodes we need to be able to bypass the restriction
on GPU related devices (/dev/kfd, /dev/dri/renderXXXX) set
for root:render, see
ROCm/k8s-device-plugin#39

We don't need anymore to vary the kfd access policies, so it seems
good to transform the option into something more flexible for
a broader range of use cases.

Bug: T333009
Change-Id: Idab004a1a725b1223d4ee36d2d0d900c329140f9
  • Loading branch information
elukey committed Apr 20, 2023
1 parent 6534b6d commit 4e9fba7
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 13 deletions.
31 changes: 22 additions & 9 deletions modules/amd_rocm/manifests/init.pp
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,17 @@
# so please check the supported versions before setting it.
# Default: "42"
#
# [*kfd_access_group*]
# Add a udev rule for the kfd device to allow access to users
# of a specific group. This is usually not needed since the kfd
# device should be readable by the 'render' group.
# Default: undef
# [*allow_gpu_broader_access*]
# Add udev custom rules to allow access to the GPU devices (kfd, renderXXXX)
# by "others" in order to bypass any group restriction (for example, by the render
# group). This should be enabled only on nodes without shared/multi-user setup
# (for example, k8s nodes but not stat100x nodes).
# Default: false
#
#
class amd_rocm (
String $version = '42',
Optional[String] $kfd_access_group = undef,
Boolean $allow_gpu_broader_access = false,
) {

$supported_versions = ['42', '431', '45', '54']
Expand All @@ -34,13 +35,25 @@
fail('Please use ROCm 5.4 with Bullseye, other versions are not supported.')
}

if $kfd_access_group {
# In most cases, like the stat100x nodes, we are able to control all the users
# and add them to the 'render' group, needed to access the various devices
# exposed by ROCm to the OS. In cases like k8s, we delegate the GPU
# to a device plugin that then exposes the GPU to the Kubelet, and it gets
# complicated to respect the 'render' posix group access restriction
# (see https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/39 for
# more info).
if $allow_gpu_broader_access {
file { '/etc/udev/rules.d/70-kfd.rules':
group => 'root',
owner => 'root',
mode => '0544',
content => "SUBSYSTEM==\"kfd\", KERNEL==\"kfd\", TAG+=\"uaccess\", GROUP=\"${kfd_access_group}\"",
require => Group[$kfd_access_group],
content => "SUBSYSTEM==\"kfd\", KERNEL==\"kfd\", MODE=\"0666\"",
}
file { '/etc/udev/rules.d/70-render.rules':
group => 'root',
owner => 'root',
mode => '0544',
content => "SUBSYSTEM==\"drm\", KERNEL==\"renderD*\", MODE=\"0666\"",
}
}

Expand Down
8 changes: 4 additions & 4 deletions modules/profile/manifests/amd_gpu.pp
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
# == Class profile::amd_gpu
#
class profile::amd_gpu (
$rocm_version = lookup('profile::amd_gpu::rocm_version', { 'default_value' => undef }),
$kfd_access_group = lookup('profile::amd_gpu::kfd_access_group', { 'default_value' => undef }),
Optional[String] $rocm_version = lookup('profile::amd_gpu::rocm_version', { 'default_value' => undef }),
Boolean $allow_gpu_broader_access = lookup('profile::amd_gpu::allow_gpu_broader_access', { 'default_value' => false }),
) {

if $rocm_version {
Expand All @@ -15,8 +15,8 @@
require profile::python38

class { 'amd_rocm':
version => $rocm_version,
kfd_access_group => $kfd_access_group,
version => $rocm_version,
allow_gpu_broader_access => $allow_gpu_broader_access,
}

class { 'prometheus::node_amd_rocm':
Expand Down

0 comments on commit 4e9fba7

Please sign in to comment.