Skip to content

Latest commit

 

History

History
101 lines (74 loc) · 8.36 KB

File metadata and controls

101 lines (74 loc) · 8.36 KB

Non-Structured Sparsity

Sparsity algorithm zeros weights in Convolutional and Fully-Connected layers in a non-structured way, so that zero values are randomly distributed inside the tensor. Most of the sparsity algorithms set the less important weights to zero but the criteria of how they do it is different. The framework contains several implementations of sparsity methods.

RB-Sparsity

This section describes the Regularization-Based Sparsity (RB-Sparsity) algorithm implemented in this framework. The method is based on L_0-regularization, with which parameters of the model tend to zero:

||\theta||0 = \sum{i=0}^{|\theta|} \lbrack \theta_i = 0 \rbrack

However, since the L_0-norm is non-differentiable, we relax it by adding multiplicative noise to the model parameters:

\theta_{sparse}^{(i)} = \theta_i \cdot \epsilon_i, \quad \epsilon_i \sim \mathcal{B}(p_i)

Here, \epsilon_i may be interpreted as a binary mask that selects which weights should be zeroed, hence we add the regularizing term to the objective function that encourages desired level of sparsity to our model:

L_{sparse} = \mathbb{E}{\epsilon \sim P{\epsilon}} \lbrack \frac{\sum_{i=0}^{|\theta|} \epsilon_i}{|\theta|} - level \rbrack ^2

Since we can not directly optimize distribution parameters p, we store and optimize p in the logit form:

s = \sigma^{-1}(p) = log (\frac{p}{1 - p})

and reparametrize sampling of \epsilon_i as follows:

\epsilon = \lbrack \sigma(s + \sigma^{-1}(\xi)) > \frac{1}{2} \rbrack, \quad \xi \sim \mathcal{U}(0,1)

With this reparametrization, probability of keeping a particular weight during the forward pass equals exactly to \mathbb{P}( \epsilon_i = 1) = p_i. We only use weights with p_i > \frac{1}{2} at test time. To make the objective function differentiable, we treat threshold function t(x) = x > c as straight through estimator i.e. \frac{d t}{dx} = 1

The method requires a long schedule of the training process in order to minimize the accuracy drop.

NOTE: The known limitation of the method is that the sparsified CNN must include Batch Normalization layers which make the training process more stable.

RB sparsity configuration file parameters:

{
    "algorithm": "rb_sparsity",
    "params": {
            "schedule": "multistep",  // The type of scheduling to use for adjusting the target sparsity level
            "patience": 3, // A regular patience parameter for the scheduler, as for any other standard scheduler. Specified in units of scheduler steps.
            "sparsity_init": 0.05,// "Initial value of the sparsity level applied to the model
            "sparsity_target": 0.7, // Target value of the sparsity level for the model
            "sparsity_steps": 3, // The default scheduler will do this many proportional target sparsity level adjustments, distributed evenly across 'sparsity_training_steps'.
            "sparsity_training_steps": 50, // The number of steps after which the sparsity mask will be frozen and no longer trained
            "multistep_steps": [10, 20], // A list of scheduler steps at which to transition to the next scheduled sparsity level (multistep scheduler only).
            "multistep_sparsity_levels": [0.2, 0.5] //Levels of sparsity to use at each step of the scheduler as specified in the 'multistep_steps' attribute. The firstsparsity level will be applied immediately, so the length of this list should be larger than the length of the 'steps' by one."
    },

    // A list of model control flow graph node scopes to be ignored for this operation - functions as a 'blacklist'. Optional.
    "ignored_scopes": []

    // A list of model control flow graph node scopes to be considered for this operation - functions as a 'whitelist'. Optional.
    // "target_scopes": []
}

NOTE: In all our sparsity experiments, we used the Adam optimizer and initial learning rate 0.001 for model weights and sparsity mask.

Magnitude Sparsity

The magnitude sparsity method implements a naive approach that is based on the assumption that the contribution of lower weights is lower so that they can be pruned. After each training epoch the method calculates a threshold based on the current sparsity ratio and uses it to zero weights which are lower than this threshold. And here there are two options:

  • Weights are used as is during the threshold calculation procedure.
  • Weights are normalized before the threshold calculation.

Magnitude sparsity configuration file parameters:

{
    "algorithm": "magnitude_sparsity",
    "params": {
            "schedule": "multistep",  // The type of scheduling to use for adjusting the target sparsity level
            "patience": 3, // A regular patience parameter for the scheduler, as for any other standard scheduler. Specified in units of scheduler steps.
            "sparsity_init": 0.05,// "Initial value of the sparsity level applied to the model
            "sparsity_target": 0.7, // Target value of the sparsity level for the model
            "sparsity_steps": 3, // The default scheduler will do this many proportional target sparsity level adjustments, distributed evenly across 'sparsity_training_steps'.
            "sparsity_training_steps": 50, // The number of steps after which the sparsity mask will be frozen and no longer trained
            "multistep_steps": [10, 20], // A list of scheduler steps at which to transition to the next scheduled sparsity level (multistep scheduler only).
            "multistep_sparsity_levels": [0.2, 0.5] //Levels of sparsity to use at each step of the scheduler as specified in the 'multistep_steps' attribute. The firstsparsity level will be applied immediately, so the length of this list should be larger than the length of the 'steps' by one."
    },

    // A list of model control flow graph node scopes to be ignored for this operation - functions as a 'blacklist'. Optional.
    "ignored_scopes": []

    // A list of model control flow graph node scopes to be considered for this operation - functions as a 'whitelist'. Optional.
    // "target_scopes": []
}

Constant Sparsity

This special algorithm takes no additional parameters and is used when you want to load a checkpoint already trained with another sparsity algorithm and do other compression without changing the sparsity mask.

Constant sparsity configuration file parameters:

{
    "algorithm": "const_sparsity",
    // A list of model control flow graph node scopes to be ignored for this operation - functions as a 'blacklist'. Optional.
    "ignored_scopes": []

    // A list of model control flow graph node scopes to be considered for this operation - functions as a 'whitelist'. Optional.
    // "target_scopes": []
}).