Home

Mixed Precision on GPUs Wiki

Not much to report here for now. Just a collection of references and papers about mixed-precision algorithms and reduced precision functionality on GPUs.

Tensor Cores for matrix multiplication

CUTLASS repository: CUTLASS is best described as CUB for GEMMs, a collection of C++ primitives that can be used to generate arbitrary GEMM-like operations. Most up to date presentation / slides about CUTLASS is here.

Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs: using tensor cores to accelerate radio astronomy signal processing, which is essentially a batch CHERK operation.

Mixed-Precision Dense Linear Solvers

Mixed Precision Methods: overview of mixed precision in dense linear solvers by Dongarra.

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers: this is UTK's paper regarding the acceleration of LU solvers using tensor cores. See also a recorded presentation here.

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers: this is the prior work done by UTK, primarily on Pascal, prior to tensor cores.

Mixed-Precision Sparse Linear Solvers

Mixed precision pipelined CG algorithm by Strzodka and Gödekke: a variation of this algorithm is used QUDA.

Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. This is the first QUDA paper, outlining the use of reliable updates to improve mixed-precision solver stability.

Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs: this is the block CG solver user in QUDA, demonstrating the use of mixed-precision block CG in action. At present a mixed double-single algorithm is stable, but double-half is not.

Lattice QCD with Tensor cores: work done by Jiqun Tu to accelerate the CG solver with a block preconditioner using tensor cores. Example of shifting balance between communication and computation.

A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing: a mixed-precision solver using FP16-FP21-FP32-FP64 computing resulting in a large speedup on Summit.

Non-solver Algorithms with Tensor cores

Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision: SC18 poster describing using tensor cores to accelerate the 2-d FFT.

Accelerating Reduction and Scan Using Tensor Core Units: paper utilizing the tensor core unit to accelerate reduction and scan operations.

Accelerating High-Resolution Weather Models with Deep-Learning Hardware: paper presented at PASC 2019 on whether low-precision tensor cores can be used for accelerating Legendre transforms in weather models.

Implementing Mixed-Precision Algorithms in CUDA

Volta & Turing: Architecture and Performance optimization: presentation by Guillaume Guillaume Thomas-Collignon on Volta and Turing optimization. Includes examples of where and how to use half precision can benefit (from a roofline perspective).

Programming Tensor Cores: second half of the CUTENSOR talk at GTC, which describes the low-level interface to the tensor cores introduced with CUDA 10.1

NVIDIA Tensor Core Programmability, Performance & Precision: paper from KTH and ORNL regarding the programmability and precision of tensor cores. Using CUDA 9.0, so a bit out of date.

NVIDIA block post about programming tensor cores