Lov-e-cuda is a single-file, header-only, C++ library providing basic utilities for CUDA programming. It aims at imposing very little disruption compared to "traditional" CUDA programming, while providing the comfort of RAII and utilities to facilitate DRY.
Do you sometime forget to check for CUDA errors? Are you tired of writing the same error checking code again and again?
Lov-e-cuda provides a set of simple function-like macros to check the return value of a cuda API call or the last CUDA error.
Is your code filled with cudaMalloc
s, cudaFree
s, cudaMallocHost
s?
Do they sometime not match, leading to undefined behavior, crashes and memory leaks?
Do they mix with host-side malloc
s and new
s in an unpleasantly inconsistent way?
Do you sometime forget the * sizeof(float)
in the argument of a malloc
or cudaMalloc
?
Lov-e-cuda provides consistent utilities to allocate and free memory on the host and on the device, with an homogeneous API drop-in replacement for all these (de-)allocations functions. All layers are templated on the type of data allocated, to make your code more DRY.
Similarly, Lov-e-cuda provides homogeneous replacements for the cudaMemcpy
/cudaMemset
and memcpy
/memset
functions.
Most importantly, it also provides RAII classes to make sure each allocation is matched with a deallocation.
At the conceptual level, most parallel computing projects manipulate multi-dimensional arrays.
But in the code, this abstraction is often implemented using explicit computation of single-dimensional indexes like data[i * width + j]
.
Lov-e-cuda provides template classes to "view" an area of memory as a multi-dimensional array.
This view is accessed like data[i][j]
.
Thanks to inlining and return-value optimization, the performance cost is negligible when compiled with full optimization.
This increased abstraction level comes with the added benefit of boundary checks on each individual index (which can be deactivated by defining the NDEBUG
C++ preprocessor macro).
Are you tired of writing const int x = blockIdx.x * blockDim.x + threadIdx.x
? (and y
, and z
)
Do you sometime forget that your data size might not be a perfect multiple of the number of threads in your CUDA blocks?
Do you have to check if blocks or threads come first in the kernel call configuration? (Is it kernel<<<blocks, threads>>>
or kernel<<<threads, blocks>>>
?)
Lov-e-cuda provides utilities to avoid repeating computations like dim3 blocks((width + BLOCKDIM_X - 1) / BLOCKDIM_X)
(correct even when width % BLOCKDIM_X != 0
), to call kernels and to retrieve blockIdx.x * blockDim.x + threadIdx.x
in the kernels in a readable and efficient way.
Lov-e-cuda was sponsored by Laurent Cabaret from the MICS and written by Vincent Jacques.
Lov-e-cuda is licensed under the quite permissive MIT license.
Whenever appropriate, we kindly ask that you cite Lov-e-cuda in the following way: @todo Add BibTeX entry. This is particularly traditional in the academic tradition, as denoted e.g. in the doc for matplotlib and GNU parallel.
Lov-e-cuda is a single-file header-only library so getting started is very simple:
- download
lov-e.hpp
- put it in your include path, for example in
/usr/local/include
or simply besides your code - add
#include <lov-e.hpp>
in your files
Then see the "User manual" and "Examples" sections below.
Here are a few examples we provide for you to to see how Lov-e-cuda can be applied. For each of them, we provide one version that uses Lov-e-cuda, and one that doesn't. We use the same, somewhat pedantic, coding standards in all examples so that they differ only by the introduction of Lov-e-cuda.
Comparing the versions without and with Lov-e-cuda shows how the abstractions it provides simplify the code.
Comparing their runtime performance proves that Lov-e-cuda is neutral on that aspect.
Note that this is true only when compiling with -DNDEBUG to deactivate assert
s.
On the other hand, when assert
s are activated, code using Lov-e-cuda can be slower.
That is because it checks that all indexes are within boundaries, and so it's safer.
Based on this NVidia blog article by Andy Adinets about dynamic parallelism, we provide two examples that produce a 16384x16384 image of the Mandelbrot set: one with dynamic parallelism and one without.
All performance below have been measured on code compiled with -O3 -DNDEBUG
.
Example | Without Lov-e-cuda | With Lov-e-cuda |
---|---|---|
Mandelbrot (static parallelism) |
190 ms i.e. 1411 Mpix/s | 190 ms i.e. 1410 Mpix/s |
Mandelbrot (dynamic parallelism) |
43 ms i.e. 6174 Mpix/s | 42 ms i.e. 6319 Mpix/s |
Instead of allocating with cudaMalloc
,
float* data;
cudaMalloc(&data, width * height * sizeof(float));
... passing pointer and sizes to your kernel
kernel<<<1, 1>>>(data, width, height);
check_last_cuda_error_sync_device();
... declared as:
__global__ void kernel(float* data, const unsigned width, const unsigned height)
... computing explicit linearized indexes (in your kernel),
data[y * width + x]
and explicitly freeing the memory with cudaFree
,
cudaFree(data);
You can allocate an Array2D
,
Array2D<Device, float> data(height, width, uninitialized);
... pass it, alone, to your kernel,
kernel<<<1, 1>>>(ref(data));
check_last_cuda_error_sync_device();
... declared to accept an ArrayView2D
:
__global__ void kernel(ArrayView2D<Device, float> data)
... and use it with logical indexes (in your kernel).
data[y][x]
Memory will be freed automatically at the end of the current scope.
You can allocate a multidimensional array on the host in a consistent manner:
Array3D<Host, int> a(42, 36, 57, zeroed);
Note that:
Array
classes are provided from1D
to5D
- the first template argument can be either:
Device
to allocate memory usingcudaMalloc
Host
to allocate memory usingstd::malloc
- the first N constructor arguments are the sizes corresponding to the indexes in the same order
- the last constructor argument can be either:
uninitialized
to get undetermined valueszeroed
to get only zeroes:
EXPECT_EQ(a[0][0][0], 0);
EXPECT_EQ(a[41][35][56], 0);
Sizes can be retrieved from the Array
(or ArrayView
):
EXPECT_EQ(a.s2(), 42);
EXPECT_EQ(a.s1(), 36);
EXPECT_EQ(a.s0(), 57);
Note how the left-most size is named s2
: this is so that taking the size after partial indexing is consistent: s0
will always be the right-most size:
EXPECT_EQ(a[0].s1(), 36);
EXPECT_EQ(a[0].s0(), 57);
EXPECT_EQ(a[0][0].s0(), 57);
Because, yes, you can pass a partially indexed ArrayView
to a function:
f(a[0][12]);
Where f
expects a lower-dimension ArrayView
:
void f(ArrayView1D<Host, int> a)
You can also clone an Array
(or ArrayView
) from one memory space to another with a simple call to its clone_to
method:
Array3D<Device, int> b = a.clone_to<Device>();
Or copy
the data between two existing arrays:
copy<Host, Device>(a, ref(b));
The template parameters can be omitted but we recommend you keep them for explicitness.
(Click the arrows to expand each topic)
How to deal with non-trivial types?
It's best practice to use std::malloc
and cudaMalloc
only on trivial types but if you really want to use an Array
of non-trivial content type, we've got you covered.
Given the following non-trivial type:
struct NonTrivial {
NonTrivial() {}
};
You'll need to define the following template specializations:
template<>
NonTrivial* Host::alloc<NonTrivial>(const std::size_t n) {
return Host::force_alloc<NonTrivial>(n);
}
template<>
void Host::memset<NonTrivial>(const std::size_t n, const char v, NonTrivial* const p) {
Host::force_memset<NonTrivial>(n, v, p);
}
template<>
NonTrivial* Device::alloc<NonTrivial>(const std::size_t n) {
return Device::force_alloc<NonTrivial>(n);
}
template<>
void Device::memset<NonTrivial>(const std::size_t n, const char v, NonTrivial* const p) {
Device::force_memset<NonTrivial>(n, v, p);
}
This will let you use it in an Array
:
Array1D<Host, NonTrivial> h(10, zeroed);
Array1D<Device, NonTrivial> d(10, zeroed);
The specializations of memset
are required only for zeroed
, and the specializations for Host
(resp. Device
) are required only to create Host
(resp. Device
) Array
s.
Access to raw pointer
For some legacy uses and/or for integrating with third-party libraries, you may need the actual pointer to the underlying data.
You can get it with a.data()
.
Don't overuse it; you should be able to avoid that most of the time.
Lower-level memory management
If RAII doesn't fit the style of your application (yet?), you can still benefit from the more homogeneous API of this library.
@todo Document
@todo Document
@todo Document
Lov-e-cuda is developed in a controlled environment using Docker and the Docker image built automatically by make.sh
from builder/Dockerfile
.
Contributors only need reasonably recent versions of Python, Bash, and Docker to run ./run-development-cycle.py
to run all automated tests.
There are a few automated tests:
- unit tests using Google Test are in the
tests
directory - some of these tests use the custom tool
builder/make-non-compilation-tests-deps.py
to test for expected compile-time errors - the code snippet in the "User manual" section above come from
tests/user-manual.cu
and are copied to thisREADME.md
file by tehrun-development-cycle.py
script