libdivide.h
is a header-only C/C++ library for optimizing integer division,
it has both a C API and a
C++ API. This is a summary of
how to use libdivide's testing tools to develop on libdivide itself.
See http://libdivide.com for more information on libdivide.
libdivide has 2 test tools:
- A verification utility tester used to help ensure that the division algorithm is correct.
- A benchmarking utility benchmark used to measure the speed increase.
The tester and benchmark programs can be built using cmake and a recent C++ compiler
that supports C++11 or later. Optionally libdivide.h
can also be installed to
/usr/local/include
.
cmake .
make -j
sudo make install
You can pass the tester binary one or more of the following arguments: u32
,
s32
, u64
, s64
to test the four cases (signed, unsigned, 32 bit, or 64 bit), or
run it with no arguments to test all four. The tester is multithreaded so it can test multiple
cases simultaneously. The tester will verify the correctness of libdivide via a set of randomly
chosen denominators, by comparing the result of libdivide's division to hardware division. It
may take a long time to run, but it will output as soon as it finds a discrepancy.
You can pass the benchmark binary one or more of the following arguments: u32
,
s32
, u64
, s64
to compare libdivide's speed against hardware division.
benchmark tests a simple function that inputs an array of random numerators and a single
divisor, and returns the sum of their quotients. It tests this using both hardware division, and
the various division approaches supported by libdivide, including vector division.
It will output data like this:
# system scalar scl_bf scl_us vector vec_bf vec_us gener algo
1 5.453 0.654 0.570 0.223 0.565 0.603 0.235 1.282 0
2 5.453 1.045 0.570 0.496 0.568 0.603 0.511 11.215 1
3 5.453 1.534 0.570 0.570 0.587 0.603 0.570 11.887 2
4 5.409 0.654 0.570 0.223 0.565 0.603 0.235 1.282 0
5 5.409 1.045 0.570 0.496 0.568 0.603 0.509 11.215 1
6 5.409 1.534 0.570 0.570 0.587 0.603 0.570 11.887 2
...
It will keep going as long as you let it, so it's best to stop it when you are happy with the denominators tested. These columns have the following significance. All times are in nanoseconds, and lower is better.
#: The divisor that is tested
system: Hardware divide time
scalar: libdivide time, using scalar functions
scl_bf: libdivide time, using branchfree scalar functions
scl_us: libdivide time, using scalar unswitching functions
vector: libdivide time, using vector functions
vec_bf: libdivide time, using branchfree vector functions
vec_us: libdivide time, using vector unswitching
gener: Time taken to generate the divider struct
algo: The algorithm used. See libdivide_*_get_algorithm
The benchmarking utility will also verify that each function returns the same value, so benchmark is valuable for its verification as well.
The first code snippet divides all integers in a vector using integer division. This is slow as integer division is at least one order of magnitude slower than any other integer arithmetic operation on current CPUs.
void divide(std::vector<int64_t>& vect, int64_t divisor)
{
// Slow, uses integer division
for (auto& n : vect)
n /= divisor;
}
The second code snippet runs much faster, it uses libdivide to compute the integer division using multiplication and bit shifts hence avoiding the slow integer divison operation.
void divide(std::vector<int64_t>& vect, int64_t divisor)
{
libdivide::divider<int64_t> fast_d(divisor);
// Fast, computes division using libdivide
for (auto& n : vect)
n /= fast_d;
}
Generally libdivide will give at significant speedup if:
- The divisor is only known at runtime
- The divisor is reused multiple times e.g. in a loop
When using libdivide's C API you first need to generate a libdivide divider using
one of the libdivide_*_gen
functions (*: s32, u32, s64, u64)
which can then be used to compute the actual integer division using the
corresponding libdivide_*_do
or libdivide_*_branchfree_do
functions.
void divide(int64_t *array, size_t count, int64_t divisor)
{
struct libdivide_s64_t fast_d = libdivide_s64_gen(divisor);
// Fast, computes division using libdivide
for (size_t i = 0; i < count; i++)
array[i] = libdivide_s64_do(array[i], &fast_d);
}
For more information please visit the C API documentation on libdivide's website.
The default libdivide divider makes use of branches to compute the integer division. When the same divider is used inside a hot loop as in the C++ example section the CPU will accurately predict the branches and there will be no performance slowdown. Often the compiler is even able to move the branches outside the body of the loop hence completely eliminating the branches, this is called loop-invariant code motion.
If however you are e.g. iterating over an array of dividers the CPU will not accurately predict the branches and this will deteriorate performance. For this use case the branchfree divider type will often run significantly faster, it computes the integer division without use of any branches.
// Use branchfree divider
using branchfree_t = libdivide::branchfree_divider<uint64_t>;
uint64_t divide(uint64_t x, std::vector<branchfree_t>& vect)
{
uint64_t sum = 0;
for (auto& fast_d : vect)
sum += x / fast_d;
return sum;
}
Caveats of branchfree divider:
- Branchfree divider cannot be
-1
,0
,1
- Faster for unsigned types than for signed types
We mentioned in the "Branchfull vs branchfree" section that the default branchfull libdivide divider uses branches. It is possible to get rid of the branches and the preliminary checks when using the default branchfull divider using a technique called unswitching. Unswitching moves out of the body of the loop the preliminary algorithm check so that the computation inside the loop is branchfree.
using namespace libdivide;
void divide(std::vector<int64_t>& vect, int64_t divisor)
{
divider<int64_t> fast_d(divisor);
switch (fast_d.get_algorithm())
{
case 0: for (auto& n : vect) n /= unswitch<0>(fast_d); break;
case 1: for (auto& n : vect) n /= unswitch<1>(fast_d); break;
case 2: for (auto& n : vect) n /= unswitch<2>(fast_d); break;
case 3: for (auto& n : vect) n /= unswitch<3>(fast_d); break;
case 4: for (auto& n : vect) n /= unswitch<4>(fast_d); break;
}
}
For more information please visit the API documentation on libdivide's website.
Before sending in patches to libdivide, please run the tester to completion with all four types, and the benchmark utility for a reasonable period, to ensure that you have not introduced a regression.
Happy hacking!