This is a simple project aimed to do online logistic regression. We hope it can make learning with big data easier for individuals and small groups.
It was started as a course project in Statistical Learning. And then improved and tested in the 2nd Season of iPinyou RTB Contest.
In its simple structure, there are mainly two parts:
-
Feeder
A Feeder is responsible for preparing and providing data for learning. Sources are located in
src/feeder
subdirectory. Currently a feeder for the iPinyou RTB Contest offline data is implemented. -
Learner
A Learner takes parsed data and update model according learning rules. This is usually the main part of the learning algorithm. Sources are located in
src/learner
subdirectory. Currently only a truncated SGD method (See TrSGD.) for logistic regression is implemented.
We have provided the code we used in the
iPinyou RTB Contest
as an example.
This example is located at examples/rtb2a
.
Suppose you already have the datafiles, then
-
Concatenate the clk files as a single file, put it at
data/clk.txt
( Suppose you already changed your current directory toexamples/rtb2a
). Concatenate the conv files and put it atdata/conv.txt
. -
Edit
imp.list
and replace the filenames with your datafiles' real locations. (note that soft link may fail.) -
Edit
param_learner.txt
to adjust learning parameters.
Now if you haven't build the binary, type make
to build the main program.
Make sure that your C++ compiler supports -std=c++0x
(or -std=c++11
).
If everything goes fine, you can then start the learning process.
- Run
./main imp.list 30000000 0
to learn the logistic regression model with 30000000 iterations of truncated SGD.
Note that by default the program will check the existence of model.txt
and auto-load it to continue the training. If you want to learn a new model,
you should delete the old file or move it to somewhere else.
-
More Learner
We hope we can implement more algorithm for more models. This may include but not be limited to LSE, SVM, MF, NN.
-
More Feeder
Possible support for CSV, gzipped/bziped txt file, etc.
-
Parallel support
We are planning to add parallel support for single machine (multiple threads) in the future.
-
Why do you use SGD instead of some 2nd order method?
We are in favor of SGD instead of some 2nd order method because SGD is more efficient when we have enormous data but limited time and computation resources. A good reference is [L.Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent] (http://leon.bottou.org/publications/pdf/compstat-2010.pdf)