/LibNB

A C++ implementation of naive Bayes model

Primary LanguageC++

Table of Contents
=================
- Introduction
- Installation
- Data Format
- Usage
- Examples
- Additional Information


Introduction
============
XIA-NB is a C++ implementation of Naive Bayes Classifier, which is a well-known generative classification algorithm for applications such as text classification. The Naive Bayes algorithm requires the probabilistic distribution to be discrete. XIA-NB uses the multinomial event model for representation, the maximum likelihood estimate with a Laplace smoothing technique for learning parameters. A sparse-data structure is defined to represent the feature vector in XIA-NB to seek higher computational speed.


Installation
============

On Linux system, type `make' to build the `nb_learn' and `nb_classify' programs. Run them without arguments to show the usages of them.

On Windows system, refer to `Makefile' to build them, or use the pre-built binaries (in the directory `windows').


Data Format
===========

The format of training and testing data file is:

<label>	<index1>:<value1> <index2>:<value2> ...
.
.
.

Each line contains an instance and is ended by a '\n' character.

<label> is an integer indicating the class id. The range of class id should be from 1 to the size of classes. For example, the class id is 1, 2, 3 and 4 for a 4-class classification problem.
 
<label> and <index>:<value> are sperated by a '\t' character. <index> is a postive integer denoting the feature id. The range of feature id should be from 1 to the size of feature set. For example, the feature id is 1, 2, ... 9 or 10 if the dimension of feature set is 10. Indices must be in ASCENDING order. <value> is a float denoting the feature value. The value must be an INTEGER since Naive Bayes Algorithm requires the probabilistic distribution to be discrete.

If the feature value equals 0, the <index>:<value> is encouraged to be neglected for the consideration of storage space and computational speed.

Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any class labels.


Usuage
======

XIA-NB learning module

usage: nb_learn [options] training_file model_file

options: -h        -> help
         -e [0,1]  -> 0: multi-variate Bernoulli event model
                   -> 1: multinomial event model (default)
         -s [0]    -> Laplace smoothing (default)


XIA-NB classification module

usage: nb_classify [options] testing_file model_file output_file

options: -h        -> help
         -e [0,1]  -> 0: multi-variate Bernoulli event model
                   -> 1: multinomial event model (default)		
         -f [0..2] -> 0: only output class label (default)
                   -> 1: output class label with log-likelihood
                   -> 2: output class label with probability


Examples
========

The "data" directory contains a dataset of text classification task. This dataset 
has six class labels and more than 250,000 features. 

For learning with the default multinomial event model:

> nb_learn data/train.samp data/nb.mod

For learning with the multi-variate Bernoulli event model:

> nb_learn -e 0 data/train.samp data/nb0.mod

For classifing with the default multinomial event model and the default output format:

> nb_classify data/test.samp data/nb.mod data/nb.out

For classifing with the multi-variate Bernoulli event model and the loglikelihood output:

> nb_classify -e 0 -f 1 data/test.samp data/nb0.mod data/nb0.out


Additional Information
======================

For any questions and comments, please email rxia.cn@gmail.com.