This is a refactored version of the popular libSVM library.
The original code is a literate port of C code to Java, and uses more memory than necessary. This makes the Java version noticeably slower, and harder to extend.
This version is heavily refactored (and still a work in progress), and in particular makes it easier to use custom data types (e.g. non numeric data) as well as custom kernel functions efficiently.
These results are from a single run of 20news with linear kernel and otherwise default options, as I'm still in progress of testing the codebase for bugs, not for performance.
All accuracy evaluation was done with the C version.
C-SVM | libSVM (C) | libSVM (Java) | libModernSVM |
---|---|---|---|
Training time | 89.32s | 138.78s | 116.43s |
Accuracy | 84.022% | 84.022% | 84.022% |
Accuracy (abs) | 3355/3993 | 3355/3993 | 3355/3993 |
Total nSV | 12771 | 12771 | 12771 |
--------------- | ------------ | --------------- | -------------- |
nu-SVM | libSVM (C) | libSVM (Java) | libModernSVM |
--------------- | ------------ | --------------- | -------------- |
Training time | 163.40s | 264.00s | 221.17s |
Accuracy | 81.5928% | 81.5928% | 81.5928% |
Accuracy (abs) | 3258/3993 | 3258/3993 | 3258/3993 |
Total nSV | 14015 | 14015 | 14015 |
As desired, our runtime improved substantially (15-20% faster) over the original Java version, but we do not yet achieve the performance of the C version. As is, the results appear to be identical, when the parameters are chosen the same way (note that the current version does not have a command line version).
The original implementation used a very C style data type: each vector was
represented as an svm_node[]
array. For Java, this is a rather costly choice.
C uses structs, and svm_node[]
then will be a continuous block of memory,
interleaving int
and double
(with the worst case that the values get padded
to 8 byte boundaries).
Java on the other hand does not have efficient structs. Many small objects can hurt your performance, in particular when the garbage collection kicks in. This representation of sparse vectors yields a kind of ragged array: each array entries points to an object (which may reside at a different memory location) and these tiny objects, carrying 12 bytes of payload but using 24 bytes of memory. This memory layout therefore has an overhead of about 100%.
My replacement vector representation - consisting of an int[]
for the indexes
and a double[]
array - only adds 36-40 bytes of total overhead per vector.
Plus, I assume, this memory layout is easier to optimize for Java and easier
to perform garbage collection. I found these vector representations to work
well when working on ELKI Data Mining. Many of
the changes done to this version of libSVM are meant to make it easier to
integrate in ELKI...
First benchmarking results showed that most of the runtime is indeed used to compute the kernel functions. Benefits in computing dot products can therefore be expected to directly translate into overall performance gains.