If you are using a released version of Kubernetes, you should refer to the docs that go with that version.
Documentation for other releases can be found at releases.k8s.io.
Author: @therc
Date: Apr 2016
Status: Design in progress, early implementation of requirements
Users should be able to request GPU resources for their workloads, as easily as for CPU or memory. Kubernetes should keep an inventory of machines with GPU hardware, schedule containers on appropriate nodes and set up the container environment with all that's necessary to access the GPU. All of this should eventually be supported for clusters on either bare metal or cloud providers.
An increasing number of workloads, such as machine learning and seismic survey processing, benefits from offloading computations to graphic hardware. While not as tuned as traditional, dedicated high performance computing systems such as MPI, a Kubernetes cluster can still be a great environment for organizations that need a variety of additional, "classic" workloads, such as database, web serving, etc.
GPU support is hard to provide extensively and will thus take time to tame completely, because
- different vendors expose the hardware to users in different ways
- some vendors require fairly tight coupling between the kernel driver controlling the GPU and the libraries/applications that access the hardware
- it adds more resource types (whole GPUs, GPU cores, GPU memory)
- it can introduce new security pitfalls
- for systems with multiple GPUs, affinity matters, similarly to NUMA considerations for CPUs
- running GPU code in containers is still a relatively novel idea
Currently, this document is mostly focused on the basic use case: run GPU code
on AWS g2.2xlarge
EC2 machine instances using Docker. It constitutes a narrow
enough scenario that it does not require large amounts of generic code yet. GCE
doesn't support GPUs at all; bare metal systems throw a lot of extra variables
into the mix.
Later sections will outline future work to support a broader set of hardware, environments and container runtimes.
Before any scheduling can occur, we need to know what's available out there. In
v0, we'll hardcode capacity detected by the kubelet based on a flag,
--experimental-nvidia-gpu
. This will result in the user-defined resource
alpha.kubernetes.io/nvidia-gpu
to be reported for NodeCapacity
and
NodeAllocatable
, as well as as a node label.
GPUs will be visible as first-class resources. In v0, we'll only assign whole devices; sharing among multiple pods is left to future implementations. It's probable that GPUs will exacerbate the need for a rescheduler or pod priorities, especially if the nodes in a cluster are not homogeneous. Consider these two cases:
Only half of the machines have a GPU and they're all busy with other workloads. The other half of the cluster is doing very little work. A GPU workload comes, but it can't schedule, because the devices are sitting idle on nodes that are running something else and the nodes with little load lack the hardware.
Some or all the machines have two graphic cards each. A number of jobs get scheduled, requesting one device per pod. The scheduler puts them all on different machines, spreading the load, perhaps by design. Then a new job comes in, requiring two devices per pod, but it can't schedule anywhere, because all we can find, at most, is one unused device per node.
Once we know where to run the container, it's time to set up its environment. At
a minimum, we'll need to map the host device(s) into the container. Because each
manufacturer exposes different device nodes (/dev/ati/card0
, /dev/nvidia0
,
but also the required /dev/nvidiactl
and /dev/nvidia-uvm
), some of the logic
needs to be hardware-specific, mapping from a logical device to a list of device
nodes necessary for software to talk to it.
Support binaries and libraries are often versioned along with the kernel module,
so there should be further hooks to project those under /bin
and some kind of
/lib
before the application is started. This can be done for Docker with the
use of a versioned Docker
volume or
with upcoming Kubernetes-specific hooks such as init containers and volume
containers. In v0, images are expected to bundle everything they need.
The first implementation and testing ground will be for NVIDIA devices, by far the most common setup.
In v0, the --experimental-nvidia-gpu
flag will also result in the host devices
(limited to those required to drive the first card, nvidia0
) to be mapped into
the container by the dockertools library.
This is what happens before and after an user schedules a GPU pod.
-
Administrator installs a number of Kubernetes nodes with GPUs. The correct kernel modules and device nodes under
/dev/
are present. -
Administrator makes sure the latest CUDA/driver versions are installed.
-
Administrator enables
--experimental-nvidia-gpu
on kubelets -
Kubelets update node status with information about the GPU device, in addition to cAdvisor's usual data about CPU/memory/disk
-
User creates a Docker image compiling their application for CUDA, bundling the necessary libraries. We ignore any versioning requirements in the image using labels based on NVIDIA's conventions.
-
User creates a pod using the image, requiring
alpha.kubernetes.io/nvidia-gpu: 1
-
Scheduler picks a node for the pod
-
The kubelet notices the GPU requirement and maps the three devices. In Docker's engine-api, this means it'll add them to the Resources.Devices list.
-
Docker runs the container to completion
-
The scheduler notices that the device is available again
For v0, we discussed at length, but decided to leave aside initially the nvidia-docker plugin. The plugin is an officially supported solution, thus avoiding a lot of new low level code, as it takes care of functionality such as:
- creating a Docker volume with binaries such as
nvidia-smi
and shared libraries - providing HTTP endpoints that monitoring tools can use to collect GPU metrics
- abstracting details such as
/dev
entry names for each device, as well as control ones likenvidiactl
The nvidia-docker
wrapper also verifies that the CUDA version required by a
given image is supported by the host drivers, through inspection of well-known
image labels, if present. We should try to provide equivalent checks, either
for CUDA or OpenCL.
This is current sample output from nvidia-docker-plugin
, wrapped for
readability:
$ curl -s localhost:3476/docker/cli
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0
--volume-driver=nvidia-docker
--volume=nvidia_driver_352.68:/usr/local/nvidia:ro
It runs as a daemon listening for HTTP requests on port 3476. The endpoint above returns flags that need to be added to the Docker command line in order to expose GPUs to the containers. There are optional URL arguments to request specific devices if more than one are present on the system, as well as specific versions of the support software. An obvious improvement is an additional endpoint for JSON output.
The unresolved question is whether nvidia-docker-plugin
would run standalone
as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes
resource API) or whether the relevant code from its nvidia
package should be
linked directly into kubelet. A partial list of tradeoffs:
External binary | Linked in | |
---|---|---|
Use of cgo | Confined to binary | Linked into kubelet, but with lazy binding |
Expandibility | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the nvidia library as we prefer |
Bloat | None | Larger kubelet, even for systems without GPUs |
Reliability | Need to handle the binary disappearing at any time | Fewer headeaches |
(Un)Marshalling | Need to talk over JSON | None |
Administration cost | One more daemon to install, configure and monitor | No extra work required, other than perhaps configuring flags |
Releases | Potentially on its own schedule | Tied to Kubernetes' |
The first two tracks can progress in parallel.
- Define new resource
alpha.kubernetes.io:nvidia-gpu
inpkg/api/types.go
and co. - Plug resource into feasability checks used by kubelet, scheduler and schedulercache. Maybe gated behind a flag?
- Plug resource into resource_helpers.go
- Plug resource into the limitranger
- Add kubelet config parameter to enable the resource
- Make kubelet's
setNodeStatusMachineInfo
report the resource - Add a Devices list to container.RunContainerOptions
- Use it from DockerManager's runContainer
- Do the same for rkt (stretch goal)
- When a pod requests a GPU, add the devices to the container options
- Add new resource to
kubectl describe
output. Optional for non-GPU users? - Administrator documentation, with sample scripts
- User documentation
Above all, we need to collect feedback from real users and use that to set priorities for any of the items below.
- Perform real detection of the installed hardware
- Figure a standard way to avoid bundling of shared libraries in images
- Support fractional resources so multiple pods can share the same GPU
- Support bare metal setups
- Report resource usage
- Support multiple GPUs with resource hierarchies and affinities
- Support versioning of resources (e.g. "CUDA v7.5+")
- Build resource plugins into the kubelet?
- Support other device vendors
- Support Azure?
- Support rkt?
- Support OpenCL (so images can be device-agnostic)
It makes sense to turn the output of this project (external resource plugins, etc.) into a more generic abstraction at some point.
There should be knobs for the cluster administrator to only allow certain users or roles to schedule GPU workloads. Overcommitting or sharing the same device across different pods is not considered safe. It should be possible to segregate such GPU-sharing pods by user, namespace or a combination thereof.