/misvm

Multiple-Instance Support Vector Machines

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

MISVM: Multiple-Instance Support Vector Machines

by Gary Doran (gary.doran@case.edu)

Overview

MISVM contains a Python implementation of numerous support vector machine (SVM) algorithms for the multiple-instance (MI) learning framework. The implementations were created for use in the following publication:

Doran, Gary and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multiple-instance classification. To appear in Machine Learning Journal. 2013.

Installation

This package can be installed in two ways (the easy way):

# If needed:
# pip install numpy
# pip install scipy
# pip install cvxopt
pip install -e git+https://github.com/garydoranjr/misvm.git#egg=misvm

or by running the setup file manually

git clone [the url for misvm]
cd misvm
python setup.py install

Note the code depends on the numpy, scipy, and cvxopt packages. So have those installed first. The build will likely fail if it can't find them. For more information, see:

  • NumPy: Library for efficient matrix math in Python
  • SciPy: Library for more MATLAB-like functionality
  • CVXOPT: Efficient convex (including quadratic program) optimization

Contents

The MISVM package currently implements the following algorithms:

SIL

Single-Instance Learning (SIL) is a "naive" approach that assigns each instance the label of its bag, creating a supervised learning problem but mislabeling negative instances in positive bags. It works surprisingly well for many problems.

Ray, Soumya, and Mark Craven. Supervised versus multiple instance learning: an empirical comparison. Proceedings of the 22nd International Conference on Machine Learning. 2005.

MI-SVM and mi-SVM

These approaches modify the standard SVM formulation so that the constraints on instance labels correspond to the MI assumption that at least one instance in each bag is positive. For more information, see:

Andrews, Stuart, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. Advances in Neural Information Processing Systems. 2002.

NSK and STK

The normalized set kernel (NSK) and statistics kernel (STK) approaches use kernels to map entire bags into a features, then use the standard SVM formulation to find bag classifiers:

Gärtner, Thomas, Peter A. Flach, Adam Kowalczyk, and Alex J. Smola. Multi-instance kernels. Proceedings of the 19th International Conference on Machine Learning. 2002.

MissSVM

MissSVM uses a semi-supervised learning approach, treating the instances in positive bags as unlabeled data:

Zhou, Zhi-Hua, and Jun-Ming Xu. On the relation between multi-instance learning and semi-supervised learning. Proceedings of the 24th International Conference on Machine Learning. 2007.

MICA

The "multiple-instance classification algorithm" (MICA) represents each bag using a convex combinations of its instances. The optimization program is then solved by iteratively solving a series of linear programs. In our formulation, we use L2 regularization, so we solve alternating linear and quadratic programs. For more information on the original algorithm, see:

Mangasarian, Olvi L., and Edward W. Wild. Multiple instance classification via successive linear programming. Journal of Optimization Theory and Applications 137.3 (2008): 555-568.

sMIL, stMIL, and sbMIL

This family of approaches intentionally bias SVM formulations to handle the assumption that there are very few positive instances in each positive bag. In the case of sbMIL, prior knowledge on the "sparsity" of positive bags can be specified or found via cross-validation:

Bunescu, Razvan C., and Raymond J. Mooney. Multiple instance learning for sparse positive bags. Proceedings of the 24th International Conference on Machine Learning. 2007.

How to Use

The classifier implementations are loosely based on those found in the scikit-learn library. First, construct a classifier with the desired parameters:

>>> import misvm
>>> classifier = misvm.MISVM(kernel='linear', C=1.0, max_iters=50)

Use Python's help functionality as in help(misvm.MISVM) or read the documentation in the code to see which arguments each classifier takes. Then, call the fit function with some data:

>>> classifier.fit(bags, labels)

Here, the bags argument is a list of "array-like" (could be NumPy arrays, or a list of lists) objects representing each bag. Each (array-like) bag has m rows and f columns, which correspond to m instances, each with f features. Of course, m can be different across bags, but f must be the same. Then labels is an array-like object containing a label corresponding to each bag. Each label must be either +1 or -1. You will likely get strange results if you try using 0/1-valued labels. After training the classifier, you can call the predict function as:

>>> labels = classifier.predict(bags)

Here bags has the same format as for fit, and the function returns an array of real-valued predictions (use numpy.sign(labels) to get -1/+1 class predictions).

In order to get instance-level predictions from a classifier, use the instancePrediction flag, as in:

>>> bag_labels, instance_labels = classifier.predict(bags, instancePrediction=True)

The instancePrediction flag is not available for bag-level classifiers such as the NSK. However, you can always predict the labels of "singleton" bags containing a single instance to assign a label to that instance. In this case, one should use caution in interpreting the label of an instance produced by a bag-level classifier, since these classifiers are designed to make predictions based on properties of an entire bag.

An example script is included that trains classifiers on the musk1 dataset; see:

Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Install the package or add the misvm directory to the PYTHONPATH environment variable before attempting to run the example using python example.py within the example directory.

Questions and Issues

If you find any bugs or have any questions about this code, please create an issue on GitHub, or contact Gary Doran at gary.doran@case.edu. Of course, I cannot guarantee any support for this software.