Skip to content

Latest commit

 

History

History
439 lines (287 loc) · 26.4 KB

README.md

File metadata and controls

439 lines (287 loc) · 26.4 KB

hpcscanLogo.jpg

TABLE OF CONTENTS

Welcome to hpcscan

Version 1.2

Contact: Vincent Etienne / Email: [email protected]

Contributors (chronological order)

  • Vincent Etienne (NEC)
  • Suha Kayum (Saudi Aramco)
  • Marcin Rogowski (King Adbullah University of Science and Technology)
  • Laurent Gatineau (NEC)
  • Philippe Thierry (Intel)
  • Fabrice Dupros (ARM)
  • Hugo Barreiro (University of Reims Champagne-Ardenne)

Overview

Description

hpcscan is a tool for benchmarking algorithms/kernels that are found in many scientific applications on various architectures/systems.

It features several categories of test cases aiming to measure memory, computation and communication bandwidths along with electric energy consumption.

  • Written in C++
  • Simple code structure based on individual test cases
  • Easy to add new test cases
  • Hybrid OpenMP/MPI parallelism
  • Support scalar and vector CPUs, GPUs and other accelerators (depending on compiler/architecture)
  • All configuration parameters on command line
  • Support single and double precision computation
  • Compilation with standard Makefile
  • No external librairies
  • Follows C++ Google style code
  • All test cases are validated with embedded reference solutions

Why another benchmark?

There exist several benchmarks commonly used in the HPC community. Just to cite a few, the Stream benchmark, the HPL benchmark and the OSU Micro benchmarks, allow to measure the memory bandwidth, the computation bandwidth and the interconnect bandwidth respectively. In general, these benchmarks target specific characteristics of HPC systems. However, it is not straightforward to transpose these characteristics in the context of a given scientific application.

This is why, HPC vendors used to present throughputs obtained with open source scientific codes such as for instance OpenFOAM (Computational Fluid Dynamics) or SPECFEM3D (Seismology). While these results are important to assess the performance of a given architecture to solve concrete problems, it is again not straightforward to transpose conclusions to other applications. Moreover, every application has been built on technical choices that may hinder performance on a system compared to another. How to overcome these technical biases?

hpcscan has been designed to address these issues 😃

What hpcscan is

☑️ Lightweight and portable tool that can be easily deployed on a wide range of architectures including CPUs, GPUs and accelerators (see Validated hardware, operating systems and compilers).

☑️ Bridge between HPC architectures and numerical analysis/computational sciences. Beyond getting accurate performance measurements, hpcscan allows to explore the behavior of numerical kernels and to seek for the optimal configuration on a given architecture. An example is shown below where several key parameters of an algorithm (a wave propagation kernel) are explored to find the optimum (in terms of computation speed vs accuracy) on the supercomputer Shaheen II at KAUST. See Performance benchmarks for details on this test case as well as scripts to perform the analysis.

hpcscanPropaParamAnalysisShaheen.jpg

Top left: L1 Error between the computed (wavefield) and analytical solutions versus N, the number of grid points along on direction (grid size is NxNxN). Blue: Finite-Difference with 4th order stencil, Pink: 8th order and Red: 12th order. Squares are obtained with the standard propagator implementation while crosses are obtained when the Laplacian operator is computed separately.

Top right: L1 Error between the computed and analytical solutions versus the computation time. The black star points to the configuration with an error below 1% and shortest computation time (i.e. the optimal configuration relative to the target error).

Bottom left: Propagator bandwidth in GPoint/s versus N.

Bottom right: Propagator bandwidth in GBtye/s versus N.

☑️ Set of representative kernels used in many scientific applications (see List of test cases). Without being too specific, the embedded kernels provide a way to capture the main traits of HPC architectures and identify their bottle-necks and strenghts. With this knowledge, one can re-design or update accordingly specific parts of an application to take full benefit of the target hardware.

☑️ Set of robust protocols to compare architectures. As suggested in the example above, the optimal configuration to solve a given problem might change from an architecture to another. hpcscan provides a solid framework to compare performances between different systems, where one can analyse results from different perspectives and achieve 'apples to apples' comparisons.

☑️ Customizable to fit a specific hardware (see Customization).

☑️ Multi-purpose initiative with benefits at several levels: from computer science students eager to learn to seasoned numerical analysts willing to share their findings or to software engineers reusing kernels of interest to upgrade their applications.

☑️ On-going effort aiming to collect contributions to cover the current offer of HPC systems. More options and kernels will be added with time.

What hpcscan is not

One-number benchmark to rank HPC systems. However, hpcscan provides a way to perform a complete 'scanning' of architectures and possibly focus on one characteristic.

Confidential project. Everyone is invited to share results, feedbacks and more important contributions for the benefit of the entire HPC community.

Quick start

hpcscan is a self-content package that can be easily installed and executed on your system. Just follow the steps:

Versions

Version Description Release date
v1.0 Initial version with CPU and Vector Engine support
  • Test cases: Comm, FD_D2, Grid, Memory and Propa
  • FD orders: 2, 4, 8, 12 & 16
  • Test modes: Baseline, CacheBlk and NEC_SCA
  • Nov 28, 2020
    v1.1 GPU support
  • Added test modes CUDA and HIP
  • Added test mode NEC
  • May 22, 2021
    v1.2 Energy consumption
  • Access hardware counters to report energy (Watt) consumption
  • Added FD orders: 6, 10 & 14
  • Added test mode DPC++
  • Added test modes CUDA_Opt, CUDA_Ref and HIP_Opt
  • Coming soon

    Main features

    Project directories

    • bin this directory is created during compilation and contains hpcscan executable
    • build hpcscan can be compiled from here
    • env scripts to initialize hpcscan environment
    • mics output samples and studies
    • script scripts for validation and performance benchmarks
    • src all hpcscan source files

    List of test cases

    Test case name Description Remark
    Comm MPI communications bandwidth
    • Uni-directional (Half-duplex with MPI_Send) proc1 -> proc2
    • Bi-directional (Full-duplex with MPI_Sendrecv) proc1 <-> proc2
    • Grid halos exchange (MPI_Sendrecv) all procs <-> all procs

    This case requires at least 2 MPI processes
    Depending on the placement of MPI processes, intra-node or inter-node bandwidth can be measured
    Width of halos depends on the selected FD stencil order
    ➡️ Validation against reference grids filled with predefined values
    ➡️ Measures GPoints/s and GBytes/s

    FD_D2 Finite-difference (second derivatives in space) computations bandwidth
    • (for grid dim. 1, 2 or 3)
    • (for grid dim. 2 or 3)
    • (for grid dim. 3)
    • (for grid dim 2 or 3)

    Available FD stencil orders: 2, 4, 6, 8, 10, 12, 14 and 16
    Accuracy is checked against multi-dimensional sine function
    Accuracy depends on the selected FD stencil order, the spatial grid sampling and the number of periods in the sine function
    ➡️ Computes L1 Error against analytical solution
    ➡️ Measures GPoints/s, GBytes/s and GFlop/s

    Grid Grid operations bandwidth
    • Fill grid U with constant value
    • Max. diff. between grids U and V
    • L1 norm between U and V
    • Sum of abs(U)
    • Sum of abs(U-V)
    • Max. of U
    • Min. of U
    • Complex grid manipulation (wavefield update in propagator) U = 2 x V - U + C x L
    • Boundary condition (free surface) at all edges of U

    Operations on grids include manipulation of multi-dimensional indexes and specific portions of the grids (for instance, excluding halos)
    ➡️ Validation against reference grids filled with predefined values
    ➡️ Measures GPoints/s and GBytes/s

    Memory Memory operations bandwidth
    • Fill array A with constant value
    • Copy array A = B
    • Add 2 arrays A = B + C
    • Multiply 2 arrays A = B * C
    • Add 2 arrays and update array A = A + B

    Conversely to Test Case Grid, operations are done on continuous memory arrays
    This test case is similar to the Stream benchmark
    ➡️ Validation against reference grids filled with predefined values
    ➡️ Measures GPoints/s and GBytes/s

    Modeling Acoustic wave modeling bandwidth

    Same features as for test case Propa except

    • Velocity model is read from file
    • Source is a Ricker wavelet
    • Ouput seismic traces and snapshots
    There is no accuray checking for this test case
    Propa Acoustic wave propagator bandwidth
    • 2nd order wave equation
    • Domain size is 1 m in every dimension
    • c is constant and equals to 1 m/s
    • Free surface boundary condition is applied to all edges of the domain
    • Wavefield is initialized at t=-dt and t=-2dt with a particular solution

    Accuracy is checked against the multi-dimensional analytical solution (Eigen modes) of the wave equation
    Number of modes can be parametrized differently in every dimension
    Time step can be set arbitrarily or set to the stability condition
    Dimension, grid size, and number of time steps can be set arbitrarily
    Accuracy depends on the selected FD stencil order, the spatial grid sampling and the number of Eigen modes
    ➡️ Computes L1 Error against analytical solution
    ➡️ Measures GPoints/s, GBytes/s and GFlop/s

    Template Test case template Used to create new test cases
    Util Utility tests to check internal functions Reserved for developpers

    List of test modes

    All available test modes are listed below. Activation of each test mode depends on the compilers defined in the hpscan environment script, see Environment script (mandatory).

    Test mode name Target hardware Description Remark
    Baseline Generic CPU Standard implementation without optimization ➡️ This mode is the reference implementation
    Default test mode. Always enabled
    CacheBlk Generic CPU Optimized with cache blocking techniques Always enabled
    CUDA NVIDIA GPU Regular CUDA implementation without optimization Enabled when compiled with nvcc (NVIDIA CUDA compiler)
    CUDA_Opt NVIDIA GPU Optimized CUDA implementation Enabled when compiled with nvcc (NVIDIA CUDA compiler)
    CUDA_Ref NVIDIA GPU Reference CUDA implementation (for developpers) Enabled when compiled with nvcc (NVIDIA CUDA compiler)
    DPC++ Intel CPU/GPU/FPGA Regular DPC++ implementation without optimization Enabled when compiled with dpcpp (Intel OneAPI DPC++ compiler)
    HIP AMD GPU Regular HIP implementation without optimization Enabled when compiled with hipcc (AMD HIP compiler)
    HIP_Opt AMD GPU Optimized HIP implementation Enabled when compiled with hipcc (AMD HIP compiler)
    NEC NEC SX-Aurora With NEC compiler directives Enabled when compiled with nc++ (NEC C++ compiler)
    NEC_SCA NEC SX-Aurora With NEC Library Stencil Code Accelerator Enabled when compiled with nc++ (NEC C++ compiler)
    OpenAcc NVIDIA GPU Regular OpenACC implementation without optimization Enabled when compiled with a C++ compiler that supports OpenACC (not yet operational)

    Environment set-up

    Basic requirements

    • Linux operating system
    • C++ compiler with OpenMP support
    • MPI library

    Optional requirements

    • python and Matlab to plot figures
    • NVIDIA CUDA compiler
    • Intel DPC++ compiler
    • AMD HIP compiler
    • NEC C++ compiler
    • C++ compiler with OpenACC support

    Environment script (mandatory)

    In order to compile and run hpcscan, you need to source one of the files in the directory ./env

    cd ./env

    Example to set up the environment for hpcscan with GCC and CUDA compilers:

    source ./setEnvNeptuneGccCuda.sh

    Display command output

    🔔 For a new system, you would need to create a file for your system (take example from one of the existing files)

    Compilation

    Makefile

    Go to ./build, and use the command make

    Display command output

    Executable can be found in ./bin/hpcscan

    🔔 If hpcscan environment has not been set (see Environment script (mandatory)), compilation will abort.

    By default, hpcscan is compiled in single in precision

    To compile in double precision: make precision=double

    Enabled test modes

    To check the test modes that are enabled in your hpcscan binary, use the command

    ./bin/hpcscan -v

    Display command output

    Validation

    Validation tests

    To check that hpcscan has correctly been built and works fine, go to ./script and launch

    sh runValidationTests.sh

    Display command output

    This script runs a set a light test cases and should complete within few minutes (even on a laptop).

    You should get in the ouptput report (displayed on the terminal)

    • All tests marked as PASSED (661 tests passed for each test mode enabled)
    • No test marked as FAILED

    Check the summary at the end of report to have a quick look on this.

    🔔 These tests are intended for validation purpose only, they do not allow for performance measurements.

    Validated hardware, operating systems and compilers

    hpcscan has been successfully tested on the hardware, operating systems and compilers listed below.

    Operating system Compiler MPI Host Device Test modes
    Ubuntu 22.04.1 LTS g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 mpirun (Open MPI) 4.1.2 Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Intel Kaby Lake) - Baseline, CacheBlk
    Ubuntu 22.04.1 LTS Intel icpc (ICC) 2021.7.0 20220726 Intel MPI Version 2021.7 Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Intel Kaby Lake) - Baseline, CacheBlk
    Red Hat 4.8.5-39 Intel oneAPI DPC++/C++ Compiler 2022.1.0 Intel MPI Version 2021.6 Intel(R) Xeon(R) Gold 6240L CPU @ 2.60GHz (Intel Cascade Lake) - Baseline, CacheBlk
    Red Hat 4.8.5-39
  • Intel oneAPI DPC++/C++ Compiler 2022.1.0
  • NVIDIA nvcc release 11.7
  • Intel MPI Version 2021.6 Intel(R) Xeon(R) Gold 6240L CPU @ 2.60GHz (Intel Cascade Lake) Tesla V100S-PCI (NVIDIA GPU) Baseline, CacheBlk, Cuda, Cuda_Opt, Cuda_Ref
    Red Hat 8.5.0-10 NEC nc++ (NCC) 4.0.0 NEC MPI 3.1.0 Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Intel Skylake) NEC SX-Aurora TSUBASA 20B-P (NEC Vector Engine) Baseline, CacheBlk, NEC, NEC_SCA
    Red Hat 8.5.0-10 Intel oneAPI DPC++/C++ Compiler 2022.1.0 Intel MPI Version 2021.6 Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (Intel Skylake) - Baseline, CacheBlk
    SUSE Linux Enterprise Server 15 Intel icpc (ICC) 19.0.5.281 20190815 - Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz (Intel Haswell) - -
    Red Hat 4.8.5-39 Intel icpc version 19.1.2.254 - Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (Intel Cascade Lake) - -
    Ubuntu 20.04.1 LTS
  • gcc version 9.3.0
  • NVIDIA nvcc release 11.3, V11.3.109
  • - Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Intel Ice Lake) GP108M [GeForce MX330] (NVIDIA GPU) -
    CentOS Linux release 7.7.1908
  • Intel icpc (ICC) 19.1.0.166 20191121
  • NVIDIA nvcc release 11.0, V11.0.167
  • - Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Intel Skylake) GV100GL [Tesla V100 SXM2 32GB] (NVIDIA GPU) -
    Ubuntu 20.04.1 LTS Intel(R) oneAPI DPC++ Compiler 2021.2.0 - Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (Intel Ice Lake) - -
    Ubuntu 20.04.1 LTS
  • g++ 9.3.0
  • AMD hipcc 4.2.21155-37cb3a34
  • - AMD EPYC 7742 64-Core Processor @ 2.25GHz (AMD Rome) [AMD Instinct MI100] (AMD GPU) -

    Execution

    Usage

    hpcscan can be launched from a terminal with all configuration parameters within a single line.

    To get help on the parameters

    ./bin/hpcscan -h

    Display command output

    Execution with a unique MPI process

    mpirun -n 1 ./bin/hpcscan -testCase <TESTCASE> -testMode <TESTMODE>

    where

    Example

    mpirun -n 1 ./bin/hpcscan -testCase Propa -testMode CacheBlk

    Display command output

    🔔 If you omit to specify -testMode <TESTMODE>, the Baseline mode is assumed.

    Example

    mpirun -n 1 ./bin/hpcscan -testCase Propa

    Execution with multiple MPI processes

    mpirun -n <N> ./bin/hpcscan -testCase <TESTCASE> -testMode <TESTMODE> -nsub1 <NSUB1> -nsub2 <NSUB2> -nsub3 <NSUB3>

    🔔 When several MPI processes are used, subdomain decomposition is activated. The product NSUB1 x NSUB2 x NSUB3 must be equal to N (no. of MPI proc.). You may omit to specify the number of subdomains along an axis if that number is 1.

    Example

    mpirun -n 2 ./bin/hpcscan -testCase Comm -nsub1 2

    Configuration of the grid size and dimension

    Simply add on the command line

    -n1 <N1> -n2 <N2> -n3 <N3> -dim <DIM>

    Where N1, N2, N3 are the number of grid points along axis 1, 2 and 3.

    And DIM = 1,2 or 3 (1D, 2D or 3D grids).

    Example

    mpirun -n 1 ../bin/hpcscan -testCase Grid -dim 2 -n1 200 -n2 300

    🔔 If you omit to specify -dim <DIM>, 3D grid is assumed.

    Input and output

    Input

    hpcscan does not require any input file. All data are built internally.

    Output on the terminal

    During execution, information regarding results validation and performances are sent to the terminal output.

    Output performance log file

    For every test case, an ASCII file containing all measures in a compact way is created. It can used to plot results with dedicated tools. The name of the log file is as follows

    hpcscan.perf.<TESTCASE>.log

    If hpcscan is launched several times, results are added to the log file. It is convenient for instance, when you want to analyse the effect of a parameter and plot the serie of results in a graph.

    Output grids

    Be default, the grids manipulated by hpcscan are not written on disk. To output the grids, use the option -writeGrid. When activated, each grid used in a test will generate 2 files:

    • An ASCII file with the grid dimensions (name of the file <GRIDNAME>.proc<ID>.grid.info)
    • A binary file with the grid data (name of the file <GRIDNAME>.proc<ID>.grid.bin) where ID is the MPI rank.

    Example (this is the command that was used to produce the hpcscan logo on top of this page)

    mpirun -n 1 ../../bin/hpcscan -testCase Propa -writeGrid \
           -tmax 0.2 -snapDt 0.1 \
           -dim 2 -n1 200 -n2 600 \
           -param1 4 -param2 8
    

    Outputs the following files: PropaEigenModeRef.proc0.grid.info, PropaEigenModeRef.proc0.grid.bin, PropaEigenModePrn.proc0.grid.info and PropaEigenModePrn.proc0.grid.bin

    ⚠️ Writing grids on disks slows down the execution and shouldn't be combined with performance measurements

    ⚠️ Grids can be of large size and can quickly reach your available disk space

    Output debug traces

    The code is equipped with debug traces that can be activated with the option -debug <LEVEL> where LEVEL can be set to light, mid or full (minimum, middle and maximum level of verbosity). It can be useful to activate them when developping/debugging to understand the behavior of the code. When activated, debug traces are written by each MPI proc in an ASCII file with name hpcscan.debug.proc<ID>.log where ID is the MPI rank.

    ⚠️ Debug traces slow down the execution and shouldn't be combined with performance measurements

    Performance benchmarks

    ⚠️ These benchmarks are intensive tests that require to run on HPC platforms

    🔔 Maximum memory required per node (device) is 20 GB

    🔔 At maximum, 8 computing nodes (devices) are used

    The benchmarks are independent and can be used as is or configured according to your system if needed.

    Test cases description

    Test case Objectives Remarks
    Memory Assess memory bandwidth Scalability analysis on a single node
    Grid Assess bandwidth of grid operations Analyse effect of the grid size
    Comm Assess inter-node communication bandwidth Analyse effect of subdomain decomposition
    FD_D2 Assess FD spatial derivative computation bandwidth Analyse effect of FD stencil order
    Propa Find optimal configuration for the wave propagator Explore range of parameters
    Propa Scalability analysis of wave propagator on multiple nodes Analyse effect of the FD stencil order

    ➡️ Performance measurements and scripts to reproduce results obtained on various architectures are available in ./misc/hpcscanPerfSlides/hpcscanPerfSlides.pdf

    Customization

    hpcscan is built on a simple yet very flexible design heavily relying on inheritance feature of C++.

    The main class is Grid (see ./src/grid.cpp). This class handles all grid data in hpcscan and all operations performed on grids. It implements the so-called Baseline mode and it is the reference implementation.

    💡 All test cases, at some point, call methods of this class. Indeed, test cases (testCase_xxx.cpp) do not implement kernels.

    Now, let us say, you would like to specialize the implementation for a given architecture.

    To do this, you would need to create a new class that derives from Grid. For instance, you will create Grid_ArchXYZ.h and Grid_ArchXYZ.cpp for your new class (you need to add the new source file in the Makefile as well). In this class, you may implement only few functions that are declared as virtual in Grid.

    💡 To allow hpcscan to use this new class, you need only to add it the 'grid factory' (see ./src/grid_Factory.cpp). This is the only location of the code where all grids are referenced.

    By doing this, you may switch at execution time, to your new grid with the -testMode <TESTMODE> option where TESTMODE = ArchXYZ.

    💡 You can proceed little by little, implementing one function at a time, with the possibility to check the behavior of your implementation against the Baseline reference solution.

    Check the grids that are already implemented in hpcscan to get some examples.

    Have fun!

    Share feedback

    • Issues encountered
    • Suggestions of new test cases
    • Performance measurements

    Contributing to hpcscan

    ➡️ If you want to contribute to hpcscan, please contact the project coordinator ([email protected]).