CUDA Target

The cuda mode targets NVidia GPU hardware via the proprietary driver API and C compiler.

Resources provided by NVidia

The necessary documentation and software can be downloaded from the web site.

In particular you will need the following downloads:

The CUDA toolkit installer.
A compatible driver version.

The CUDA Programming Guide

The SDK download is not necessary because it only contains C++ code examples.

Hardware requirements

Detailed error reporting from GPU code (enabled via optimize debug >= 1) requires a device with CUDA capability no less than 1.1 and support for mapping of CPU memory into GPU address space.

CUDA contexts

All GPU operations require an active CUDA context. It can be created via the following function:

(cuda-create-context device-id &optional flags)

The device ID is an ordinal index of the device. The flags argument may be used to specify a list of the following possible values:

:sched-spin – instructs the system to actively spin while waiting for the GPU.
:sched-yield – instructs the system to yield its CPU slice while waiting.
:blocking-sync – instructs the system to block the thread.

:map-host – enables mapping of CPU memory if supported by the device.

Contexts are bound to the thread that created them and form a stack. The following function returns the current context of the current thread:

(cuda-current-context)

A context can be destroyed in the following way:

(cuda-destroy-context (cuda-current-context))

Invoking GPU code

Every CUDA context contains its own instance of a GPU code module. The instance is loaded when the module is first accessed, and automatically reinitialized if it changes.

CUDA kernels accept the following predefined keyword parameters:

:block-cnt-x
:block-cnt-y
Define the block grid.

:thread-cnt-x
:thread-cnt-y
:thread-cnt-z
Define the in-block thread grid.

All of the mentioned parameters default to 1.

Dynamic arrays must be allocated as CUDA linear memory, or mapped host memory buffers (see the Buffers page for details). For debugging convenience kernel wrappers can handle any buffer type by allocating temporary areas and automatically copying data.

Error recovery

The driver API was designed by NVidia for use in languages like C++, where fixing a bug requires recompiling and restarting the program. One of the consequences is that once a crash is detected in GPU code, the CUDA context becomes completely unusable and must be re-created from scratch.

In order to improve usability in REPL environment, the library implements in-place reinitialization of the current CUDA context and all associated objects. This operation can be invoked via the following function:

(cuda-recover)

The operation preserves the state of the lisp wrapper objects, but modifies the underlying low-level memory pointers and handles. Also, since a failed context doesn’t even allow reading device memory, reallocated device memory blocks are filled with 0.

*cuda-debug*

When this global variable is true, all allocated device memory blocks are mirrored in ordinary memory, which allows recovering their contents, but costs a lot of additional memory copies and makes all kernel calls strictly synchronous. It also makes cuda-create-context automatically include :map-host in the flag list.

For convenience, common operations provide a recover-and-retry restart when a CUDA error condition is signalled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly