JuliaAI/MLJ.jl

Failed to use TunedModel with precomputed-SVM

KeishiS opened this issue · 2 comments

First of all, thank you for the great work you're doing in maintaining this project. I encoutered what seems to be a bug when attempting to use a support vector classifier with a precomputed Gram matrix, while performing hyperparameter tuning using TunedModel. I would like to submit a pull request to address the issue, but I'm unsure which part of the codebase needs modification. Any advice would be greatly appreciated.

Describe the bug
When performing parameter search with TunedModel on an SVM with a precomputed kernel, the data splitting is not carried out properly.

To Reproduce

#%%
using MLJ, MLJBase
using MLJScikitLearnInterface
using LinearAlgebra
SVMClassifier = @load SVMClassifier pkg = MLJScikitLearnInterface

#%% Create toy data
using Random, Distributions
θ₀ = rand(Uniform(0, 2π), 100)
X₀ = 0.5 .* [cos.(θ₀) sin.(θ₀)] .+ (randn(100, 2) .* 0.12)
y₀ = zeros(Int, 100)

θ₁ = rand(Uniform(0, 2π), 100)
X₁ = [cos.(θ₁) sin.(θ₁)] .+ (randn(100, 2) .* 0.12)
y₁ = ones(Int, 100)

n = 200
X = vcat(X₀, X₁)
y = MLJBase.categorical(vcat(y₀, y₁))
gmat = [
    exp(-norm(X[i, :] - X[j, :]) * 0.1)
    for i in 1:n, j in 1:n
]

#%%
model = SVMClassifier(kernel="precomputed")
tuning_model = TunedModel(
    model=model,
    range=range(model, :C; lower=0.01, upper=1000, scale=:log),
    measure=accuracy
)
mach = machine(tuning_model, gmat, y)
fit!(mach)

Expected behavior

During the process of searching for the best params, the Gram matrix gmat is divided into training data and test data. We expect gmat[train_idx, train_idx] and gmat[test_idx, train_idx] to be created. However, the current code splits it into gmat[train_idx, :] and gmat[test_idx, :]. This operation is executed in the fit_and_extract_on_fold function in MLJBase.jl/src/resampling.jl.

Versions

  • julia 1.10.5
  • MLJ v0.20.0
  • MLJBase v1.7.0
  • MLJScikitLearnInterface v0.7.0

I would be grateful for any advice on how to approach solving this issue. Thank you for taking the time to read and consider this matter!

Thanks @KeishiS for the positive feedback and for posting.

I'm afraid, that when MLJTuning (or evaluate!) resamples it has no way of knowing it is supposed to also apply the resampling to some hyperparameter.

It looks like you may have better luck with the LIBSVM version of the model (also provided an MLJ interface). In this case you can pass a kernel function rather than an explicit matrix, which won't suffer this issue, right? Would this suit your purpose?


For the record, it is theoretically possible to fix the sk-learn API. The proper interface point for "metadata" that needs to be resampled is to pass it along with the data. So, a corrected workflow would look something like

mach = machine(SVC(), X, y, kernel)  
evaluate!(mach, resampling=...)

To implement this would require also adding a "data front end" to the MLJ interface, to articulate exactly how the resampling is to be done, because the default resampling of arrays (just resample the rows) doesn't work in this case.

Unfortunately, the MLJ sk-learn interfaces are created with a lot of metaprogramming and are therefore difficult to customise. So a fix here would be complicated.

cc @tylerjthomas9

Thank you for your reply! 😄

I wasn't familiar with the concept of a "data front end", so I'll take some time to study the information at the link you provided.

While the example code creates a gram matrix from simple toy data, I'm currently considering using a graph kernel where processing multiple graphs in parallel would be more efficient. That's why I was hoping to use it as a precomputed kernel if possible. I appreciate your suggestion of the LIBSVM. I'll try it.

Based on the information you've provided, I'll think about whether there might be a good alternative approach. For now, I'll close this issue. Thank you very much for taking the time to address my concerns.