/LibLR

A C++ implementation of Multi-class Logistic Regression (Softmax Regression), supporting large-scale sparse data and GD, SGD and L-BFGS optimization.

Primary LanguageC

LibLR V0.20

Available at https://github.com/rxiacn/LibLR

Please read the LICENSE file before using LibLR.


Table of Contents
=================
- Introduction
- Installation
- Data Format
- Usage
- Examples
- Additional Information


Introduction
============
LibLR is a C++ implementation of Multi-class Logistic Regression (Softmax Regression). 
The maximal likelihood estimation (MLE) is used for parameter learning. MLE 
equals learning under the cross-entropy criterion principle. Two optimization
methods are supported: 1) gradient descent; 2) stochastic gradient descent.
LibLR uses a sparse-data structure to represent the feature vector to seek 
higher computational speed. Some other techniques such as online updating, 
Gaussian prior regularization are also supported.


Installation
============

On Linux system, type `make' to build the `lr_train' and `lr_classify'
programs. Run them without arguments to show the usages of them.

On Windows system, refer to `Makefile' to build them, or use the pre-built
binaries (in the directory `windows').


Data Format
===========

The format of training and testing data file is:

<label>	<index1>:<value1> <index2>:<value2> ...
.
.
.

Each line contains an instance and is ended by a '\n' character.

<label> is an integer indicating the class id. The range of class id should be
from 0 to the size of classes subtracting one. For example, the class id is 0,
1, 2 and 3 for a 4-class classification problem.
 
<label> and <index>:<value> are separated by a '\t' character. <index> is a positive
integer denoting the feature id. The range of feature id should be from 1 to the size
of feature set. For example, the feature id is 1, 2, ... 9 or 10 if the dimension of
feature set is 10. <value> is a float denoting the value of feature.

If the feature value equals 0, the <index>:<value> is encouraged to be neglected
for the consideration of storage space and computational speed.

Labels in the testing file are only used to calculate accuracy or errors. 
If they are unknown, just fill the first column with any class labels.


Usage
======

The training module in OpenPR-LDF

usage: lr_train [options] training_file model_file [pre_model_file]

options: -h         -> help
         -o [0,1,2] -> 0: gradient descent (default)
                    -> 1: stochastic gradient descent
                    -> 2: L-BFGS
         -n int     -> maximal iteration loops (default 200)
         -m double  -> minimal loss value decrease (default 1e-03)
         -l float   -> learning rate (default 1.0)
         -r double  -> regularization parameter lambda of gaussian prior (default 0)         
         -u [0,1]   -> 0: initial training model (default)
                    -> 1: updating model (pre_model_file is needed)


The classification module in OpenPR-LDF

usage: lr_classify [options] testing_file model_file output_file

options: -h        -> help
         -f [0..2] -> 0: only output class label (default)
                   -> 1: output class label with log-likelihood (weighted sum)
                   -> 2: output class label with soft probability


Examples
========

The "data" directory contains a dataset of text classification task. This dataset 
has six class labels and more than 250,000 features. 

For training with gradient descent method (maximal iteration loops equals 50, 
minimal loss value decrease equals 1e-06, learning rate equals 1, and use the average weights):

> lr_train -n 50 -m 1e-06 data/train.samp data/lr.mod

For training with stochastic gradient descent method:

> lr_train -o 1 -n 50 -m 1e-06 data/train.samp data/lr2.mod

For online-updating the previous model with new training data, you and use the 
-u option and give the previous model file:

> lr_train -u 1 data/new_train.samp data/lr_new.mod data/lr.mod

For classify testing file with only class label output:

> lr_classify example/test.samp data/lr.mod data/lr.out

For classify testing file with class label and log-likelihood (also known as 
the weighted sum):

> lr_classify -f 1 data/test.samp data/lr.mod data/lr_1.out

For classify testing file with class label and soft probabilities (converted from
log-likehood using sigmoid function):

> lr_classify -f 2 data/test.samp data/lr.mod data/lr_2.out


Additional Information
======================

For any questions and comments, please email to rxiacn@gmail.com.