/python-ssk

Fast String Subsequence Kernel (SSK) implementation in Python (Cython)

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

Fast String Kernel (SSK) implementation for python

Implementation of string kernel as described in Lodhi et al (2002) (aka. SSK).

The main function is written Cython (Python with C type-level annotations) and is the second fastest implementation of the SSK Kernel that I know of. The fastest SSK implementation I found is Shogun's SubsequenceStringKernel.cpp. I copied a small trick from Shogun's implementation that speeds the computation substantially.

Requisites

  • Python
  • Cython
  • Numpy

How to use

Take a look at main.py. That hopefully should suffice.

As an example, the following is a simple wrapper around string_kernel() to use SSK in scikit-learn:

def get_ssk_kernel_for_scikit(max_substring, lambda_decay):
    def strker(il, ir):
        #print("Shape of gramm matrix to create ({},{})".format(len(il), len(ir)))
        # assuming that `il` and `ir` are lists of strings.
        # `len(il)` may fail to give you the size real size if you're using `np.array`s
        # the idea is to reshape your data to be `np.array`s of sizes (n, 1) and (m, 1)
        l = np.array(il).reshape((len(il), 1))
        r = np.array(ir).reshape((len(ir), 1))
        return string_kernel(l, r, max_substring, lambda_decay)
    return strker

max_substring = 5
lambda_decay = .8

my_ssk_kernel = get_ssk_kernel_for_scikit(max_substring, lambda_decay)

TODO

  • string_kernel should accept arbitrary lists not only python strings and numpy arrays

How does ssk() actually works?

If you're interested on how the recursive functions from the paper got converted into the very confusing, three-looped Cython function, I recommend you to check the changes in the code from the first to the fourth commits.

License

I wanted to use WTFPL to license the code, but CC0 is recommended over it as CC0 is more "legal" ;)