online logistic regression

This is a simple project aimed to do online logistic regression. We hope it can make learning with big data easier for individuals and small groups.

History

It was started as a course project in Statistical Learning. And then improved and tested in the 2nd Season of iPinyou RTB Contest.

Structure

In its simple structure, there are mainly two parts:

Feeder

A Feeder is responsible for preparing and providing data for learning. Sources are located in src/feeder subdirectory. Currently a feeder for the iPinyou RTB Contest offline data is implemented.
Learner

A Learner takes parsed data and update model according learning rules. This is usually the main part of the learning algorithm. Sources are located in src/learner subdirectory. Currently only a truncated SGD method (See TrSGD.) for logistic regression is implemented.

Usage

We have provided the code we used in the iPinyou RTB Contest as an example. This example is located at examples/rtb2a. Suppose you already have the datafiles, then

Concatenate the clk files as a single file, put it at data/clk.txt ( Suppose you already changed your current directory to examples/rtb2a ). Concatenate the conv files and put it at data/conv.txt.
Edit imp.list and replace the filenames with your datafiles' real locations. (note that soft link may fail.)
Edit param_learner.txt to adjust learning parameters.

Now if you haven't build the binary, type make to build the main program. Make sure that your C++ compiler supports -std=c++0x (or -std=c++11). If everything goes fine, you can then start the learning process.

Run ./main imp.list 30000000 0 to learn the logistic regression model with 30000000 iterations of truncated SGD.

Note that by default the program will check the existence of model.txt and auto-load it to continue the training. If you want to learn a new model, you should delete the old file or move it to somewhere else.

To do

More Learner

We hope we can implement more algorithm for more models. This may include but not be limited to LSE, SVM, MF, NN.
More Feeder

Possible support for CSV, gzipped/bziped txt file, etc.
Parallel support

We are planning to add parallel support for single machine (multiple threads) in the future.

FAQ

Why do you use SGD instead of some 2nd order method?

We are in favor of SGD instead of some 2nd order method because SGD is more efficient when we have enormous data but limited time and computation resources. A good reference is [L.Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent] (http://leon.bottou.org/publications/pdf/compstat-2010.pdf)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src		src
Doxyfile		Doxyfile
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

online logistic regression

History

Structure

Usage

To do

FAQ

About

Releases

Packages

Languages

lns/online-lr

Folders and files

Latest commit

History

Repository files navigation

online logistic regression

History

Structure

Usage

To do

FAQ

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages