Add cross validation for parameter selection

Question

Add cross validation for parameter selection

Opened this issue a year ago · 12 comments

I want to add automatic parameter selection for λ

function cross_validate_spline(x, y, smoothing_parameter_grid, nfolds = length(X) -1)
    function my_cumsum(x)
        n = length(x)
        result = similar(x)
        result[1] = x[1]
        for i in 2:n
            result[i] = result[i-1] + x[i]
        end
        return result
    end
    # Initialize an array to store the cross-validated errors for each smoothing parameter
    cv_errors = zeros(length(smoothing_parameter_grid))
    
    # Split the data into folds for cross-validation
    n = length(x)
    fold_sizes = fill(n ÷ nfolds, nfolds)
    fold_sizes[1:(n % nfolds)] .+= 1
    folds = my_cumsum(fold_sizes)
    # Loop over each smoothing parameter
    for (i, smoothing_parameter) in enumerate(smoothing_parameter_grid)
        # Initialize an array to store the mean squared error for each fold
        fold_errors = zeros(nfolds)
        
        # Loop over each fold
        start = 1
        for (j, stop) in enumerate(folds)
            # Split the data into a training set and a validation set
            x_train = vcat(x[1:start-1], x[stop+1:end])
            y_train = vcat(y[1:start-1], y[stop+1:end])
            x_val = x[start:stop]
            y_val = y[start:stop]
            
            # Fit a smoothing spline to the training data
            spline = fit(SmoothingSpline, x_train, y_train, smoothing_parameter)
            
            # Predict the validation data
            y_pred = predict(spline, x_val)
            
            # Calculate the mean squared error for this fold
            fold_errors[j] = mean((y_val - y_pred).^2)
            
            start = stop + 1
        end
        
        # Calculate the cross-validated error for this smoothing parameter
        cv_errors[i] = mean(fold_errors)
    end
    
    # Return the smoothing parameter that resulted in the lowest cross-validated error
    return smoothing_parameter_grid[argmin(cv_errors)]
end

Example:

X = 10 .* rand(100)
Y = sin.(X) .+ 0.5 * rand(100)

λ = cross_validate_spline(X, Y, collect(range(0.001, stop=0.999, step=0.001)))

using RCall 
R"""
RSmoothingSplineLamda <- function(x, y) 
{
    m      <- smooth.spline(x = x, y = y, cv = TRUE)
    return <- m$lambda
}
"""
λR = convert(Float64, rcall(:RSmoothingSplineLamda,x = X, y = Y))
λJ = λR * (( maximum(X) - minimum(X) )^3)

println(isapprox(λJ, λ, atol=1e-1))  # Check if R and Julia answers are about the same.

I am hesitant to use my own GridSearchCV function, but I could not find other native Julia implementations. Should we ask for a general GridSearchCV function in MLJ.jl? Or we should just implement it in this package?
I can use MLBase.jl, but it isn't maintained.

The reason I am responding after 1 year, is that I first tried to use Optim.jl to minimise cross-validation error. I didn't realise that it was getting stuck in a local minimum.

Answer 1 · 2023-05-25T20:16:23.000Z

Thanks @ayushpatnaikgit. I'd be very happy to include an interface to MLJ through https://github.com/JuliaAI/MLJModelInterface.jl. A pull request for that would be welcome. I don't have time to check this right now, but I am quite positive that MLJ provides something akin to what you call GridSearchCV.

Answer 2 · 2023-06-12T06:42:23.000Z

@codetalker7 let's do this once you are available. It'll help us understand MLJ.

Answer 3 · 2023-06-27T07:00:19.000Z

I attempted Bayesian optimisation for this, and at least for some problems, it turned out to be a bit better.

function smoothcv(λ)
    n = length(X)
    k = n
    avgrss = 0
    for i in 1:k
        data = copy(X)
        ycopy = copy(Y)
        xtest = data[i]
        ytest = ycopy[i]
        deleteat!(data, i)
        deleteat!(ycopy, i)
        xtrain = data
        ytrain = ycopy
        spl = fit(SmoothingSpline, xtrain, ytrain, λ[1])
        Ypred = predict(spl, [xtest])
        avgrss += RSS([ytest], Ypred)
    end
    return avgrss/k  
end

using BayesOpt
config = ConfigParameters()         # calls initialize_parameters_to_default of the C API
config.verbose_level = 0
set_kernel!(config, "kMaternARD5")  # calls set_kernel of the C API
config.sc_type = SC_MAP
lowerbound = [0.0]; upperbound = [0.999]
optimizer, optimum = bayes_optimization(smoothcv, lowerbound, upperbound, config)
λ = optimizer

@nignatiadis @sourish-cmi, do you have any suggestions on selecting the upperbound? Generally, for any optimisation method for this problem.

Answer 4 · 2023-06-27T08:05:16.000Z

In smoothcv, the arguement λ - is it scalar? or is it vector? From your code, I think it is scalar. However, in the following line λ is trated as vector.

 spl = fit(SmoothingSpline, xtrain, ytrain, λ[1])

So I am bit confused.

What do you actually mean by "bit better"

Answer 5 · 2023-06-27T08:29:46.000Z

In smoothcv, the arguement λ - is it scalar? or is it vector? From your code, I think it is scalar. However, in the following line λ is trated as vector.
 spl = fit(SmoothingSpline, xtrain, ytrain, λ[1])
So I am bit confused.

λ is a scalar, but in BayesOpt.jl it passes the parameters as a list of scalars. This is to handle cases when a function has muliple parameters.
In this case, the list is of length 1, so I am using λ[1].

This example should make it clear:

f(x) = sum(x .^ 2)
lowerbound = [-2.0, -2.0]; upperbound = [2.0, 2.0]
optimizer, optimum = bayes_optimization(f, lowerbound, upperbound, config)

What do you actually mean by "bit better"

Hmm, yes. "Better" is vague. In my specific example, it was faster than searching the grid range(0.001, stop=0.999, step=0.001), and the objective function had a slightly lower value. But searching coarser grids, such as range(0.1, stop=0.9, step=0.1) will be faster.

Perhaps we can add both methods for optimisation, grid search and bayesian optimisation.

Answer 6 · 2023-06-27T09:22:09.000Z

In smoothcv, the arguement λ - is it scalar? or is it vector? From your code, I think it is scalar. However, in the following line λ is trated as vector.
 spl = fit(SmoothingSpline, xtrain, ytrain, λ[1])
So I am bit confused.
λ is a scalar, but in BayesOpt.jl it passes the parameters as a list of scalars. This is to handle cases when a function has muliple parameters. In this case, the list is of length 1, so I am using λ[1].

This example should make it clear:
f(x) = sum(x .^ 2)
lowerbound = [-2.0, -2.0]; upperbound = [2.0, 2.0]
optimizer, optimum = bayes_optimization(f, lowerbound, upperbound, config)

Okay this make sense.

What do you actually mean by "bit better"

Hmm, yes. "Better" is vague. In my specific example, it was faster than searching the grid range(0.001, stop=0.999, step=0.001), and the objective function had a slightly lower value. But searching coarser grids, such as range(0.1, stop=0.9, step=0.1) will be faster.

Perhaps we can add both methods for optimisation, grid search and Bayesian optimisation.

okay - I think the time gain you will really enjoy for very large data. You can try an exercise. Consider the diamonds dataset - it has more than 50 thousand rows of data. The dataset is available in RDataset.jl. Typically consider price as y and carat as x. Because of you have morethan 50k samples smoothcv is going to be extremely computationally expensive. In such case grid search will suffer. The Bayesian optimisation will work very nicely here.

Answer 7 · 2023-06-27T10:48:56.000Z

I took a sample of 1000 rows. I didn't run @btime but @time, so just a single run instead of the average of a few runs:

julia> @time bayes_optimization(smoothcv, lowerbound, upperbound, config)
 76.765974 seconds (816.28 M allocations: 81.132 GiB, 7.96% gc time)
(optimizer = [0.7110305214105249], optimum = 0.12187759391564412)

julia> @time cross_validate_spline(X, Y, collect(range(0.001, stop=0.999, step=0.001)))
   390.340428 seconds (4.05 G allocations: 416.585 GiB, 7.87% gc time)
0.585

The cv score for 0.585 is 0.12188375156979468, which is > 0.12187759391564412

I can make the method a keyword argument. The user will be able to choose the optimisation method.

Answer 8 · 2023-06-27T11:10:42.000Z

Hi both and thanks! Let me just mention the following: All of the above timings are quite slow because you use leave-one-out cross-validation (LOO-CV). If instead, you used e.g., 5-fold cross-validation, a grid search would be almost instant even for reasonably sized datasets.

What is interesting however is that for this specific model (smoothing splines), the LOO-CV error can be computed essentially for free (i.e., in the same runtime required for a single fit---you do not need to rerun the algorithm each time you leave one sample out). This computation is not currently implemented, but is described in Chapter 3.2.1 of the textbook "Nonparametric regression and generalized linear models: a roughness penalty approach" by Green and Silverman (1994). Implementing that algorithm and running grid-search on top of that should be the preferred solution for LOO-CV.

Answer 9 · 2023-06-27T11:31:24.000Z

Hi both and thanks! Let me just mention the following: All of the above timings are quite slow because you use leave-one-out cross-validation (LOO-CV). If instead, you used e.g., 5-fold cross-validation, a grid search would be almost instant even for reasonably sized datasets.

What is interesting however is that for this specific model (smoothing splines), the LOO-CV error can be computed essentially for free (i.e., in the same runtime required for a single fit---you do not need to rerun the algorithm each time you leave one sample out). This computation is not currently implemented, but is described in Chapter 3.2.1 of the textbook "Nonparametric regression and generalized linear models: a roughness penalty approach" by Green and Silverman (1994). Implementing that algorithm and running grid-search on top of that should be the preferred solution for LOO-CV.

The purpose of "Bayesian Optimisation" is to speed up the optimisation when your objective function is insanely expensive, such as LOO-CV. Such cases are the problem where one should try BayesOpt.

I would be happy to hear the method where LOO-CV error can be computed for free using k-fold cross-validation.

Answer 10 · 2023-06-27T12:14:45.000Z

The purpose of "Bayesian Optimisation" is to speed up the optimisation when your objective function is insanely expensive, such as LOO-CV.

Agreed. My point above was that for this problem LOO-CV actually is very cheap to compute (see the textbook chapter I referenced above) and as such, I would not endorse including BayesOpt in this package.

Answer 11 · 2023-06-27T13:12:28.000Z

The purpose of "Bayesian Optimisation" is to speed up the optimisation when your objective function is insanely expensive, such as LOO-CV.

Agreed. My point above was that for this problem LOO-CV actually is very cheap to compute (see the textbook chapter I referenced above) and as such, I would not endorse including BayesOpt in this package.

Sorry - our library does not have the specific book that you are recommending. Hence if you can write the specific algo here that will be really great. I hope that would not be very difficult.

Thanking you in advance.
Sourish

Answer 12 · 2023-06-27T13:46:56.000Z

Hi Sourish, I found the book. It's a really neat trick.

When you build the model, you can get LOOCV score using the intermediate calculations. This is the reason LOOCV is used, and not some other k. It is actually cheap to compute LOOCV in this case.

I can't promise when, but I will implement this. I'll have to modify the model building code itself.