The HPCG benchmark is a tool for ranking HPC systems that implements a preconditioned conjugate gradient solver. The benchmark generates a regular sparse linear system that is mathematically similar to a finite difference discretization of a three-dimensional heat diffusion equation. The problem is solved using domain decomposition, where each subdomain is preconditioned using a symmetric Gauss-Seidel sweep.
This repository contains an optimized version of HPCG for Arm that make use of optimized mathematical libraries such as the Arm Performance Libraries, NEON and SVE intrinsics. The main kernels have been modified to enable shared-memory parallelism. Further information about the code can be found in the following publications:
- Arm Community blog
- Presentations
This optimized code implements different parallelization techniques in the symmetric Gauss-Seidel kernel depending on the subdomain. For the finest level, we implement a task dependency graph. For the coarser levels, we parallelize the Gauss-Seidel kernel by using the multi-block colouring technique.
In the task dependency graph, nodes of the grid are processed as soon as all their dependencies are fulfilled. The amount of parallelism increases as nodes are processed until reaching a peak, then it decreases again. Therefore, this technique benefits from the larger grids such as the one present in the finest level of the multigrid.
The multi-block colouring technique groups consecutive nodes in blocks. Colours are assigned to blocks in a way that, given 2 blocks, any node from the first block does not have dependencies to any node from the second block. Parallelism is achieved by processing blocks with the same colour at the same time. This technique enables more parallelism than the task dependency graph one, therefore, it is a better fit for the coarser levels of the multigrid.
While processing blocks, consecutive nodes depend on each other. To break this dependency and thus create vectorization opportunities, blocks with the same colour are interleaved at the node level. This enables the Gauss-Seidel kernel to be easier vectorized.
The code currently supports both NEON
and SVE
vector extensions.
Throughout the benchmark, sparse matrices and vectors are reordered in order to improve data locality. The way this reordering is performed depends on the parallelization technique applied.
Loop unrolling has been applied at the different kernels in order to reduce loop overheads. In the same line, nested loops present at the GenerateProblem
routine have been flattened.
The code does not have any dependencies. However, in order to enable some features, external packages are required:
- An MPI implementation when enabling the MPI build of HPCG
- A compiler that supports OpenMP syntax when enabling the OpenMP build of HPCG
- A BLAS implementation when enabling the BLAS-enabled build of HPCG
- The Arm Performance Libraries when enabling the use of sparse matrix routines during the SpMV kernel
- A compiler that supports NEON intrinsics when enabling the NEON build of HPCG
- A compiler that supports SVE intrinsics when enabling the SVE build of HPCG
Build configurations are provided in the setup
folder. The naming convention for these configuration files is Make.${config}
Those must be used in order to build the benchmark. It is highly recommended to modify the chosen configuration to better suite your platform.
Builds can be performed in-source or out-of-source. For in-source builds, just type the following:
make arch=${config}
If the build is successful, the binary bin/xhpcg
will be generated.
For out-of-source builds, type the following commands:
mkdir build && cd build
../configure ${config}
# At this point, you can modify again the configuration by editing
# the build/setup/Make.${config} file
make
If the build is successful, the binary build/bin/xhpcg
will be generated.
For more detailed information, refer to the INSTALL
file.
Configuration files inside the setup
folder set different variables. One of these variables is called HPCG_OPTS
. This variable can contain defines that will be used at compilation time. The different defines that can be set are:
# -DHPCG_NO_MPI Define to disable MPI
# -DHPCG_NO_OPENMP Define to disable OPENMP
# -DHPCG_CONTIGUOUS_ARRAYS Define to have sparse matrix arrays long and contiguous
# -DHPCG_USE_DDOT_ARMPL Define to use Arm Performance Libraries calls in the ComputeDotProduct
# -DHPCG_USE_WAXPBY_ARMPL Define to use Arm Performance Libraries calls in the ComputeWAXPBY
# -DHPCG_USE_ARMPL_SPMV Define to use Arm Performance Libraries calls in the ComputeSPMV kernel. Requires ArmPL >= 19.0
# -DHPCG_USE_NEON Define to use NEON intrinsics in the main kernels
# -DHPCG_USE_SVE Define to use SVE intrinsics in the main kernels
# -DHPCG_USE_FUSED_SYMGS_SPMV Define to fuse SYMGS and SPMV when possible. This makes the run invalid for submission
# -DHPCG_DEBUG Define to enable debugging output
# -DHPCG_DETAILED_DEBUG Define to enable very detailed debugging output
As a quick summary, the benchmark is run with the following commands:
cd bin # or cd build/bin
# Set OpenMP threads when using an OpenMP-enabled build
export OMP_NUM_THREADS=4
# Use mpirun when using an MPI-enabled build
[mpirun -np 4 ...] ./xhpcg [optional in-line flags]
The xhpcg
binary will read by default the hpcg.dat
file, if present. This file contains 4 lines. The first 2 lines are ignored. The third line contains three numbers that are used to specify the local (to an MPI process) dimensions of the problem. The fourth and last line specify the number of seconds the timed portion of the benchmark should run for.
By default, the file contains the following:
HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
104 104 104
60
Meaning that the local domain will have a dimension of 104x104x104
and that the timed portion of the benchmark will run for, at least, 60 seconds.
If the file is not present, you can use the following command line flags:
--nx=<n>
to specify the local dimension in the X-axis--ny=<m>
to specify the local dimension in the Y-axis--nz=<l>
to specify the local dimension in the Z-axis--rt=<t>
to specify the number of seconds the timed portion of the benchmark should run.
If the minimum execution time of the timed portion is set to 0
, QuickPath
is enabled. This will minimize the number of steps executed throughout the benchmark and will reduce the number of conjugate gradients to be executed to 1.
Due to some of the optimizations performed in the code, there are some constraints when selecting the local domain dimensions that must considered. Therefore, the local subdomain must fulfill the following requirements:
nx
,ny
andnz
must be even numbers of every level of the gridnx
,ny
andnz
must be greater or equal to 32
The benchmark generates two different files. One contains general information of some parts of the execution such as residual generated during the different tests or the residual obtained after each conjugate gradient iteration. This file is updated throughout the execution.
The second file is generated at the very end of the execution and provides detailed metrics such as the problem local and global dimensions, number of MPI processes and OpenMP threads used and number of floating-point operations.
This file also contains the obtained GFLOPS
and an estimation of the memory throughput.
Official runs must be at least 1800 seconds (30 minutes) as reported in the output file. A valid run must also execute a problem size that is large enough so that data arrays accessed in the conjugate gradient iteration loop do not fit in the cache of the device. This restriction means that the problem size should be large enough to occupy a significant fraction of main memory, at least 1/4 of the total.
This project is licensed under Apache-2.0.
This project includes some third-party code under other open source licenses. For more information, see LICENSE
.
Contributions are accepted under Apache-2.0. Only submit contributions where you have authored all of the code. If you do this on work time, make sure you have your employer's approval.