Skip to content

Latest commit

 

History

History
80 lines (60 loc) · 7.77 KB

sparsity.md

File metadata and controls

80 lines (60 loc) · 7.77 KB

Sparsity

Sparsity is one of promising model compression techniques that can be used to accelerate the deep learning inference. Typically, sparsity can be classified as 1) structured sparsity, and 2) unstructured sparsity. Structured sparsity indicates an observed structure pattern of zero (or non-zero) values, while unstructured sparsity indicates no such pattern for zero (or non-zero) values. In general, structured sparsity has lower accuracy due to restrictive structure than unstructured sparsity; however, it can accelerate the model execution significantly with software or hardware sparsity.

The document describes the sparsity definition, sparsity training flow, validated models, and performance benefit using software sparsity. Note that the document discusses the sparse weight (with dense activation) for inference acceleration. Sparse activation or sparse embedding for inference acceleration or training acceleration is out of the scope.

Note: training for sparsity with 2:4 or similar structured pattern is supported, please refer it at our new API, question-answering examples and text-classification examples

Sparsity Definition

NVidia proposed 2:4 sparsity (or known as "2in4 sparsity") in Ampere architecture, for every 4 continuous elements in a matrix, two of them are zero and others are non-zero.

Sparsity Pattern

Different from 2:4 sparsity above, we propose the block-wise structured sparsity patterns that we are able to demonstrate the performance benefits on existing Intel hardwares even without the support of hardware sparsity. A block-wise sparsity pattern with block size S means the contiguous S elements in this block are all zero values.

For a typical GEMM, the weight dimension is IC x OC, where IC is the number of input channels and OC is the number of output channels. Note that sometimes IC is also called dimension K, and OC is called dimension N. The sparsity dimension is on OC (or N).

For a typical Convolution, the weight dimension is OC x IC x KH x KW, where OC is the number of output channels, IC is the number of input channels, and KH and KW is the kernel height and weight. The sparsity dimension is also on OC.

Here is a figure showing a matrix with IC = 32 and OC = 16 dimension, and a block-wise sparsity pattern with block size 4 on OC dimension. Sparsity Pattern

Training Flow & Sample Code

The following image describes the typical flow of training for sparsity. Compared with normal training flow, training for sparsity requires more steps (e.g., regularization and pruning) to meet the goal of sparsity ratio.

Sparsity Training Flow

Here is the pseudo code of a modified training function on PyTorch.

def train():
    for x,label in dataloader:
        y = model(x)
        loss = loss_func(y, label)
        optimizer.zero_grad()
        loss.backward()
        prune_gradient_with_magnitude()    # prune gradients
        group_lasso_regularize(alpha)     # update gradients by sparsification rate
        optimizer.step()
        lr_scheduler.step()
        prune_weights_with_magnitude()     # prune weights

Validated Models

We validate the sparsity on typical models across different domains (including CV, NLP, and Recommendation System). The below table shows the sparsity pattern, sparsity ratio, and accuracy of sparse and dense (Reference) model for each model. We also provide a simplified BERT example with only one sparse layer.

Model Sparsity Pattern Sparsity Ratio Dataset Accuracy (Sparse Model) Accuracy (Dense Model)
Bert Large 2x1 70% SQuAD 90.70% 91.34%
DLRM 4x16 85% Criteo Terabyte 80.29% 80.25%
Bert Mini 4x1 90% MRPC 87.22% 87.52%
Bert Mini 4x1 90% SST-2 86.92% 87.61%
Bert Mini 4x1 80% SQuAD 76.27% 76.87%
Bert Mini 2 in 4 50% MRPC 86.95% 87.52%
Bert Mini 2 in 4 50% SST-2 86.93% 87.61%
Bert Mini 2 in 4 50% SQuAD 76.85% 76.87%
ResNet50 v1.5 2x1 78% Image-Net 75.3% 76.13%
SSD-ResNet34 2x1 75% Coco 22.85% 23%
ResNext101 2x1 73% Image-Net 79.14% 79.37%

Note:

Performance

We explore kernels development with software sparsity and apply to DLRM, a very popular industrial recommendation model as one of MLPerf benchmarks. We achieve 1.6x performance gains on INT8 sparse model over INT8 dense model, and 6.4x total performance gains over FP32 dense model in MLPerf inference submissions. We expect further performance speedup with the support of hardware sparsity.

Dense Model (FP32) Dense Model (INT8) Sparse Model (INT8)
Accuracy 80.25% (100%) 80.21% (99.96%) 79.91% (99.57%)
Offline QPS 5732 23174 (1.0x) 36883 (1.6x)
Online QPS NA 20245 (1.0x) 30396 (1.5x)