Implementation of string kernel as described in Lodhi et al (2002) (aka. SSK).
The main function is written Cython (Python with C type-level annotations) and is the second fastest implementation of the SSK Kernel that I know of. The fastest SSK implementation I found is Shogun's SubsequenceStringKernel.cpp. I copied a small trick from Shogun's implementation that speeds the computation substantially.
- Python
- Cython
- Numpy
Take a look at main.py
. That hopefully should suffice.
As an example, the following is a simple wrapper around string_kernel()
to use SSK in scikit-learn:
def get_ssk_kernel_for_scikit(max_substring, lambda_decay):
def strker(il, ir):
#print("Shape of gramm matrix to create ({},{})".format(len(il), len(ir)))
# assuming that `il` and `ir` are lists of strings.
# `len(il)` may fail to give you the size real size if you're using `np.array`s
# the idea is to reshape your data to be `np.array`s of sizes (n, 1) and (m, 1)
l = np.array(il).reshape((len(il), 1))
r = np.array(ir).reshape((len(ir), 1))
return string_kernel(l, r, max_substring, lambda_decay)
return strker
max_substring = 5
lambda_decay = .8
my_ssk_kernel = get_ssk_kernel_for_scikit(max_substring, lambda_decay)
string_kernel
should accept arbitrary lists not only python strings and numpy arrays
If you're interested on how the recursive functions from the paper got converted into the very confusing, three-looped Cython function, I recommend you to check the changes in the code from the first to the fourth commits.
I wanted to use WTFPL to license the code, but CC0 is recommended over it as CC0 is more "legal" ;)