Truncated gradient descent example

Truncated Gradient Descent Example for VW

VW has an efficient (approximate) implementation of the truncated gradient algorithm for online L1 regularization. This paper provides an example using the rcv1 data set to illustrate the use of it. The (exact) online L2 regularization in VW can be done similarly, with the --l1 option below replaced by --l2.

We use the same training and test data prepared as in the RCV1 example; the cache files are cache_train and cache_test. The test label file will be needed for classifier evaluation, and is obtained by

zcat rcv1.test.dat.gz | cut -d ' ' -f 1 | sed -e 's/^-1/0/' > test_labels

The following three steps run (1) training, (2) testing, (3) evaluation of ROC, and (4) measuring model size, respectively:

vw --cache_file cache_train --final_regressor r_temp --passes 3 --readable_model r_temp.txt --l1 lambda1
vw --testonly --initial_regressor r_temp --cache_file cache_test --predictions p_out
perf -ROC -files test_labels p_out
cat r_temp.txt | grep -c ^[0-9]

where

lambda1 is the regularization level applied to online learning
r_temp.txt is the human-readable model file for us to count the number of nonzero weights in the learned regressor

By varying lambda1, we see the role of L1 regularization on prediction performance (ROC in particular) and model size:

lambda1	ROC	Model Size
0	0.98346	41409
5e-8	0.98345	39985
1e-7	0.98345	38822
5e-7	0.98345	31899
1e-6	0.98345	26559
5e-6	0.98319	12564
1e-5	0.98288	7647
5e-5	0.98068	1860
1e-4	0.97804	921
1e-3	0.92469	53

Note that L1 and L2 can be used simultaneously in VW, which resembles the elastic net. To see the role of L2-regularization better, the training data is first subsampled at 1% rate, yielding a set of roughly 7.8K examples. Let cache_train_small be the training cache file. The previous commands are modified slightly by adding the --l2 option as follows:

vw --cache_file cache_train_small --final_regressor r_temp --passes 1000 --readable_model r_temp.txt --l1 lambda1 --l2 lambda2
vw --testonly --initial_regressor r_temp --cache_file cache_test --predictions p_out
perf -ROC -files test_labels p_out
cat r_temp.txt | grep -c ^[0-9]

Note that we set the number of passes to 1000 so that we can see the phenomenon of overfitting.

The table below reports the ROC metric and model size by varying lambda1 and lambda2:

lambda1	lambda2	ROC	Model Size
0	0	0.96863	20832
0	0.0005	0.97364	21490
1e-7	0.0005	0.97364	21470
1e-6	0.0005	0.97363	21149
1e-5	0.0005	0.97348	14857
5e-5	0.0005	0.97231	4185
1e-4	0.0005	0.97003	2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncated gradient descent example

Truncated Gradient Descent Example for VW

Clone this wiki locally