Detail explains on MXNet SGD weight_decay and tfa SGDW and l2 regularizer #19

leondgarse · 2020-12-09T02:26:47Z

leondgarse
Dec 9, 2020
Maintainer

MXNet SGD and tfa SGDW

AdamW and Super-convergence is now the fastest way to train neural nets
The behavior of weight_decay in mx.optimizer.SGD and tfa.optimizers.SGDW is different.

MXNet SGD multiplies wd with lr.

import mxnet as mx
help(mx.optimizer.SGD)
# weight = weight - lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
#        = (1 - lr * wd) * weight - lr * (rescale_grad * clip(grad, clip_gradient))

Test with learning_rate=0.1, weight_decay=5e-4, weight is actually modified by 5e-5.

import mxnet as mx
mm_loss_grad = mx.nd.array([[1., 1], [1, 1]])

mm = mx.nd.array([[1., 1], [1, 1]])
mopt = mx.optimizer.SGD(learning_rate=0.1)
mopt.update(0, mm, mm_loss_grad, None)
print(mm.asnumpy())  # Basic value is `mm - lr * mm_loss = 0.9`
# [[0.9 0.9] [0.9 0.9]]

mm = mx.nd.array([[1., 1], [1, 1]])
mopt = mx.optimizer.SGD(learning_rate=0.1, wd=5e-4)
mopt.update(0, mm, mm_loss_grad, None)
print(mm.asnumpy())  # 0.9 - 0.89995 = 5e-5
# [[0.89995 0.89995]  [0.89995 0.89995]]

tfa SGDW behaves different, it does NOT multiply wd with lr. With learning_rate=0.1, weight_decay=5e-4, weight is actually modified with 5e-4.

# /opt/anaconda3/lib/python3.7/site-packages/tensorflow_addons/optimizers/weight_decay_optimizers.py
# 170     def _decay_weights_op(self, var, apply_state=None):
# 177             return var.assign_sub(coefficients["wd_t"] * var, self._use_locking)

import tensorflow_addons as tfa
ww_loss_grad = tf.convert_to_tensor([[1., 1.], [1., 1.]])
ww = tf.Variable([[1., 1.], [1., 1.]])
opt = tfa.optimizers.SGDW(learning_rate=0.1, weight_decay=5e-4)
opt.apply_gradients(zip([ww_loss_grad], [ww]))
print(ww.numpy()) # 0.9 - 0.8995 = 5e-4
# [[0.8995 0.8995] [0.8995 0.8995]]

So learning_rate=0.1, weight_decay=5e-4 in mx.optimizer.SGD is equal to learning_rate=0.1, weight_decay=5e-5 in tfa.optimizers.SGDW.

weight decay multiplier If we set wd_mult=10 in a MXNet layer, wd will mutiply by 10 in this layer. This means it will be weight_decay == 5e-4 in a keras layer.

# https://github.com/apache/incubator-mxnet/blob/e6cea0d867329131fa6052e5f45dc5f626c00d72/python/mxnet/optimizer/optimizer.py#L482
# 29  class Optimizer(object):
# 482                lrs[i] *= self.param_dict[index].lr_mult

L2 Regularization and Weight Decay

Weight Decay == L2 Regularization?
PDF DECOUPLED WEIGHT DECAY REGULARIZATION

Keras l2 regularization

ww = tf.convert_to_tensor([[1.0, -2.0], [-3.0, 4.0]])

# loss = l2 * reduce_sum(square(x))
aa = keras.regularizers.L2(0.2)
aa(ww)  # tf.reduce_sum(ww ** 2) * 0.2
# 6.0

# output = sum(t ** 2) / 2
tf.nn.l2_loss(ww)
# 15.0
tf.nn.l2_loss(ww) * 0.2
# 3.0

Total loss with l2 regularization will be

total_loss = Loss(w) + λ * R(w)

Keras.optimizers.SGD

help(keras.optimizers.SGD)
# w = w - learning_rate * g
#   = w - learning_rate * g - learning_rate * Grad(l2_loss)

So with keras.regularizers.L2(λ), it should be

wd * weight = Grad(l2_loss)
    --> wd * weight = 2 * λ * weight
    --> λ = wd / 2

Test

ww_loss_grad = tf.convert_to_tensor([[1., 1.], [1., 1.]])
ww = tf.Variable([[1., 1.], [1., 1.]])
opt = keras.optimizers.SGD(0.1)
with tf.GradientTape() as tape:
    # l2_loss = tf.nn.l2_loss(ww) * 5e-4
    l2_loss = keras.regularizers.L2(5e-4 / 2)(ww)  # `tf.nn.l2_loss` divided the loss by 2, `keras.regularizers.L2` not
l2_grad = tape.gradient(l2_loss, ww).numpy()
opt.apply_gradients(zip([ww_loss_grad + l2_grad], [ww]))
print(ww.numpy()) # 0.9 - 0.89995 = 5e-5
# [[0.89995 0.89995] [0.89995 0.89995]]

That means the L2_regulalizer will modify the weights value by l2 * lr == 5e-4 * 0.1 = 5e-5.

If we want the same result as mx.optimizer.SGD(learning_rate=0.1, wd=5e-4) and wd_mult=10 in a MXNet layer, which actually decay this layer's weights with wd * wd_mult * learning_rate == 5e-4, and other layers wd * learning_rate == 5e-5.

Firstlly, the keras optimizer is tfa.optimizers.SGDW(learning_rate=0.1, weight_decay=5e-5).
Then add a keras.regularizers.L2 with l2 == weight_decay / learning_rate * (wd_mult - 1) / 2 to this layer.

ww_loss_grad = tf.convert_to_tensor([[1., 1.], [1., 1.]])
ww = tf.Variable([[1., 1.], [1., 1.]])
opt = tfa.optimizers.SGDW(learning_rate=0.1, weight_decay=5e-5)
with tf.GradientTape() as tape:
    l2_loss = keras.regularizers.L2(5e-5 / 0.1 * (10 - 1) / 2)(ww)
l2_grad = tape.gradient(l2_loss, ww).numpy()
opt.apply_gradients(zip([ww_loss_grad + l2_grad], [ww]))
print(ww.numpy()) # 0.9 - 0.8995 = 5e-4
# [[0.8995 0.8995] [0.8995 0.8995]]

SGD with momentum

MXNet

# incubator-mxnet/python/mxnet/optimizer/sgd.py, incubator-mxnet/src/operator/optimizer_op.cc +109
grad += wd * weight
momentum_stat = momentum * momentum_stat - lr * grad
weight += momentum_stat

Keras SGDW Using wd == lr * wd, weight will be the same with MXNet SGD in the first update, but momentum_stat will be different. Then in the second update, weight will also be different.
```
momentum_stat = momentum * momentum_stat - lr * grad
weight += momentum_stat - wd * weight
```

Keras SGD with l2 regularizer can behave same as MXNet SGD

grad += regularizer_loss
momentum_stat = momentum * momentum_stat - lr * grad
weight += momentum_stat

Keras Model test

import tensorflow_addons as tfa

def test_optimizer_with_model(opt, epochs=3, l2=0):
    kernel_regularizer = None if l2 == 0 else keras.regularizers.L2(l2)
    aa = keras.layers.Dense(1, use_bias=False, kernel_initializer='ones', kernel_regularizer=kernel_regularizer)
    aa.build([1])
    mm = keras.Sequential([aa])
    loss = lambda y_true, y_pred: (y_true - y_pred) ** 2 / 2
    mm.compile(optimizer=opt, loss=loss)
    for ii in range(epochs):
        mm.fit([[1.]], [[0.]], epochs=ii+1, initial_epoch=ii, verbose=0)
        print("Epoch", ii, "- [weight]", aa.weights[0].numpy(), "- [losses]:", mm.history.history['loss'][0], end="")
        if len(opt.weights) > 1:
            print(" - [momentum]:", opt.weights[-1].numpy(), end="")
        print()
    return mm, opt

test_optimizer_with_model(tf.keras.optimizers.SGD(learning_rate=0.1), epochs=3)
# Epoch 0 - [weight] [[0.9]] - [losses]: 0.5
# Epoch 1 - [weight] [[0.81]] - [losses]: 0.4049999713897705
# Epoch 2 - [weight] [[0.729]] - [losses]: 0.32804998755455017
test_optimizer_with_model(tf.keras.optimizers.SGD(learning_rate=0.1), l2=0.01, epochs=3)
# Epoch 0 - [weight] [[0.898]] - [losses]: 0.5099999904632568
# Epoch 1 - [weight] [[0.806404]] - [losses]: 0.411266028881073
# Epoch 2 - [weight] [[0.7241508]] - [losses]: 0.33164656162261963
test_optimizer_with_model(tfa.optimizers.SGDW(learning_rate=0.1, weight_decay=0.002), epochs=3)
# Epoch 0 - [weight] [[0.898]] - [losses]: 0.5
# Epoch 1 - [weight] [[0.806404]] - [losses]: 0.40320199728012085
# Epoch 2 - [weight] [[0.72415084]] - [losses]: 0.3251436948776245
test_optimizer_with_model(tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9), epochs=3)
# Epoch 0 - [weight] [[0.9]] - [losses]: 0.5 - [momentum]: [[-0.1]]
# Epoch 1 - [weight] [[0.71999997]] - [losses]: 0.4049999713897705 - [momentum]: [[-0.17999999]]
# Epoch 2 - [weight] [[0.486]] - [losses]: 0.25919997692108154 - [momentum]: [[-0.23399998]]
test_optimizer_with_model(tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9), l2=0.01, epochs=3)
# Epoch 0 - [weight] [[0.898]] - [losses]: 0.5099999904632568 - [momentum]: [[-0.102]] ==> 0.102 * 0.1
# Epoch 1 - [weight] [[0.714604]] - [losses]: 0.411266028881073 - [momentum]: [[-0.183396]] ==> -0.102 * 0.9 - 0.898 * 1.02 * 0.1
# Epoch 2 - [weight] [[0.47665802]] - [losses]: 0.2604360580444336 - [momentum]: [[-0.237946]]
# ==> momentum_stat_2 == momentum_stat_1 * momentum - weight_1 * (1 + l2 * 2) * learning_rate
test_optimizer_with_model(tfa.optimizers.SGDW(learning_rate=0.1, momentum=0.9, weight_decay=0.002), epochs=3)
# Epoch 0 - [weight] [[0.898]] - [losses]: 0.5 - [momentum]: [[-0.1]]
# Epoch 1 - [weight] [[0.71640396]] - [losses]: 0.40320199728012085 - [momentum]: [[-0.1798]]
# Epoch 2 - [weight] [[0.48151073]] - [losses]: 0.25661730766296387 - [momentum]: [[-0.2334604]]

test_optimizer_with_model(tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9), l2=0.1, epochs=3)
# Epoch 0 - [weight] [[0.88]] - [losses]: 0.6000000238418579 - [momentum]: [[-0.12]]
# Epoch 1 - [weight] [[0.66639996]] - [losses]: 0.4646399915218353 - [momentum]: [[-0.21360001]]
# Epoch 2 - [weight] [[0.39419195]] - [losses]: 0.266453355550766 - [momentum]: [[-0.272208]]

MXNet model test

wd_mult NOT working if just added in mx.symbol.Variable, has to be added by opt.set_wd_mult.

import mxnet as mx
import logging
logging.getLogger().setLevel(logging.ERROR)

def test_optimizer_with_mxnet_model(opt, epochs=3, wd_mult=None):
    xx, yy = np.array([[1.]]), np.array([[0.]])
    xx_input, yy_input = mx.nd.array(xx), mx.nd.array(yy)
    dataiter = mx.io.NDArrayIter(xx, yy)

    data = mx.symbol.Variable("data", shape=(1,))
    label = mx.symbol.Variable("softmax_label", shape=(1,))
    # ww = mx.symbol.Variable("ww", shape=(1, 1), wd_mult=wd_mult, init=mx.init.One())
    ww = mx.symbol.Variable("ww", shape=(1, 1), init=mx.init.One())
    nn = mx.sym.FullyConnected(data=data, weight=ww, no_bias=True, num_hidden=1)

    # loss = mx.symbol.SoftmaxOutput(data=nn, label=label, name='softmax')
    loss = mx.symbol.MakeLoss((label - nn) ** 2 / 2)
    # sss = loss.bind(mx.cpu(), {'data': xx_input, 'softmax_label': yy_input, 'ww': y_pred})
    # print(sss.forward()[0].asnumpy().tolist())
    # [[0.5]]
    if wd_mult is not None:
        opt.set_wd_mult({'ww': wd_mult})
    model = mx.mod.Module(context=mx.cpu(), symbol=loss)
    weight_value = mx.nd.ones([1, 1])
    for ii in range(epochs):
        loss_value = loss.bind(mx.cpu(), {'data': xx_input, 'softmax_label': yy_input, 'ww': weight_value}).forward()[0]
        # model.fit(train_data=dataiter, num_epoch=ii+1, begin_epoch=0, optimizer=opt, force_init=True)
        model.fit(train_data=dataiter, num_epoch=ii+1, begin_epoch=ii, optimizer=opt)
        weight_value = model.get_params()[0]['ww']
        # output = model.get_outputs()[0]
        print("Epoch", ii, "- [weight]", weight_value.asnumpy(), "- [losses]:", loss_value.asnumpy()[0, 0])
        # if len(opt.weights) > 1:
        #     print(" - [momentum]:", opt.weights[-1].numpy(), end="")
        # print()

test_optimizer_with_mxnet_model(mx.optimizer.SGD(learning_rate=0.1, wd=0.02))
# Epoch 0 - [weight] [[0.898]] - [losses]: 0.5
# Epoch 1 - [weight] [[0.806404]] - [losses]: 0.403202
# Epoch 2 - [weight] [[0.7241508]] - [losses]: 0.3251437
test_optimizer_with_mxnet_model(mx.optimizer.SGD(learning_rate=0.1, wd=0.002))
# Epoch 0 - [weight] [[0.8998]] - [losses]: 0.5
# Epoch 1 - [weight] [[0.80964005]] - [losses]: 0.40482002
# Epoch 2 - [weight] [[0.72851413]] - [losses]: 0.3277585
test_optimizer_with_mxnet_model(mx.optimizer.SGD(learning_rate=0.1, momentum=0.9, wd=0.02))
# Epoch 0 - [weight] [[0.898]] - [losses]: 0.5
# Epoch 1 - [weight] [[0.714604]] - [losses]: 0.403202
# Epoch 2 - [weight] [[0.47665802]] - [losses]: 0.25532946
test_optimizer_with_mxnet_model(mx.optimizer.SGD(learning_rate=0.1, momentum=0.9, wd=0.02), wd_mult=10)
# Epoch 0 - [weight] [[0.88]] - [losses]: 0.5
# Epoch 1 - [weight] [[0.66639996]] - [losses]: 0.3872
# Epoch 2 - [weight] [[0.39419195]] - [losses]: 0.22204445
# ==> Equals to keras model `l2 == 0.1`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detail explains on MXNet SGD weight_decay and tfa SGDW and l2 regularizer #19

{{title}}

Replies: 0 comments

Select a reply

Detail explains on MXNet SGD weight_decay and tfa SGDW and l2 regularizer #19

leondgarse Dec 9, 2020 Maintainer

MXNet SGD and tfa SGDW

L2 Regularization and Weight Decay

SGD with momentum

Keras Model test

MXNet model test

Replies: 0 comments

leondgarse
Dec 9, 2020
Maintainer