BlasBenito/spatialRF

Prediction function for spatial models?

Closed this issue · 6 comments

Hi,

First of all, I like your package what you are building here very much. I found a few small bugs in the functions "rf_interactions" and "rf_spatial", which I have eliminated locally for the time being so that these functions work with my data. If you want, I can send you my changes or something.

But one more question. Do you have a spatial model prediction function? This would be very important for my purposes?

best regards Frank

Dear Frank, thanks a lot for testing the package!
Please, let me know what changes you had to do to make these functions work, I will implement them ASAP.

About the prediction, it is a generic predict() function, just the same you would use with ranger(), that is what's running under the hood in spatialRF:

predicted <- stats::predict(
object = your_model,
data = your_data,
type = "response"
)$predictions

Cheers,
Blas

Also, I forgot to tell you that the models have a "predictions" slot with the values of the predictions for each record in the training data.

Hi Blas,

Thank you for your answer. Sorry for the bad english. I'm not that good at languages, so I get a little help from Google Translate.

I noticed several points.

  1. Errors:
    If you do not set the n.core parameter in the functions rf_tuning, rf_repeat, rf_spatial, rf_evaluate and rf_interactions, the following error message appears: Error in file (con, "w"): all connections are in use. I have not tested whether the error also occurs with other functions.

    The errors or changes in the functions rf_interactions and rf_spatial are as follows:

    • rf_interactions (line 111 - 118): add "stringsAsFactors = F" inside as.data.frame()
    • rf_interactions (line 258 - 264): add "stringsAsFactors = F" inside data.frame()
    • rf_interactions (line 271 - 272): message and stop merge into one line stop("message")
    • rf_interactions (line 284 - 285): message and stop merge into one line stop("message")
    • rf_spatial (line 401 and 438): the output must be converted into a character (example: spatial.predictors.selected <- as.character(spatial.predictors.selection$best.spatial.predictors))
  2. Suggestion for improvement:
    What is not so nice is that the various functions always reinitialize a cluster when they are called in parallel mode. It would be better if you initialize a cluster globally at the beginning and the functions always use it. The prerequisite for this is that the packege automatically recognizes whether a local clauster is present or not. That would save a lot of time.

  3. Question of understanding:
    Do I understand correctly that the rf function in your package is a normal ranger rf model, in which you represent the spatial autocorrelation of the residues as a function of the distance in the form of a plot? The goal of your package is it to reduce these?

  4. Prediction:
    No that's not what I mean. I wanted to know whether you were working on a prediction function for the spatial model and not for the non-spatial model or whether it already existed? The difficulty is namely the spatial predictors which are used in the spatial model. These currently only exist within the spatial model and are therefore missing in the prediction (error message prediction: Error in predict.ranger.forest (forest, data, predict.all, num.trees, type: Error: One or more independent variables not found in data .).

  5. General:
    I'm currently working on a similar spatial approach myself using rf models, which works a little differently. I use the carret package as the basic framework, as it offers a very large selection of different models and different approaches to capturing the spatial structure. If you are interested, I can go into more detail later. Here are a few papers that have served me as a basis, among other things:

    Basically, everything works for me (including the spatial prediction with spatial predictors) but it currently only exists locally and is not yet as nicely structured as yours, including the possibilities with the plots and analysis options.

    My consideration now is whether you would be interested in combining the different ideas into one package based on your package? If not, I'm thinking about forking your package and working on it myself, but of course it's faster and more fun together.

best regards
Frank

Dear Frank, I appreciate your feedback a lot, thank you, really. And please, do not worry about your English whatsoever. I will answer as best as possible your questions.

  1. I don't seem to be able to replicate the issue you found with n.cores, and I haven't seen the message "Error in file (con, "w"): all connections are in use." ever before. What operating system are you using? If it is windows, I will check that as soon as I can, since I don't have access to a windows system. I have fixed all these little things with the messages and stringsAsFactors. Didn't notice the latter because I am working with R version 4.

  2. You are totally right about that, the functions to select spatial predictors, do initiate a cluster on each selection run. That's because they were designed as stand-alone functions at the beginning of the package development, and never had the time to change that. I have it on my todo list now, but I don't have a lot of bandwidth for development right now, so I am not sure about what's the timeline for such change.

  3. Yes, rf() is just, as its help file says "A convenient wrapper for ranger that completes its output by providing the Moran's I of the residuals for different distance thresholds...". The goal of the package is to help understand the influence of the spatial organization of the data on the response variable (that's why the importance scores of the spatial predictors are reported), or in other words, to provide honest importance scores for the predictors once the model is taking into account the spatial structure of the data. The thing is that adding the spatial predictors onto the model helps reduce spatial autocorrelation, so at the end, the functions to select the spatial predictors have that in mind, and try to minimize the spatial autocorrelation of the residuals while selecting spatial predictors to add to the model.

  4. The spatial predictors are generated from a PCA of the distance matrix of the records used in the model. The values of such PCA do not exist outside of such distance matrix, these values are only defined for the training data, and therefore are not available as raster files you would use to make a spatial prediction of the model results. There are workarounds that I am not planning to implement in my package (since I focus on explanatory rather than predictive models), such as this one by Hengl et all: https://github.com/thengl/GeoMLA. Other one is to compute the spatial predictors on the distance matrix of the raster cells, convert the results to raster files, extract the values of the training data on all these rasters, and introduce the columns referring to the spatial predictors into select_spatial_predictors_sequential to select the spatial predictors, use the non-spatial and spatial-predictors to train a model, that you can later predict onto the rasters of both types of predictors. This has a limitation though, distance matrices get pretty big in memory very fast, so there are RAM limitations to that approach.

  5. It is nice that you are working on a similar approach! However, as I said before, I don't have a lot of bandwidth for package development right now, as other things at work are requiring most of my attention. Furthermore, spatialRF is reaching its release form, and once a few little things are fixed (including the one about the cluster you mentioned), I will submit it to CRAN and work on the paper. In any case, if you think that the package structure and code help you in any way, please, feel free to fork it and work on an alternative version.

I hope I answered all your questions, I really appreciate the time you have put into those. I will keep the issue open for now.

Best wishes,

Blas

Hi,

I don't have much time right now, but you forgot to change line 438 in the rf_spatial function. ;-)

Hi Blas,

Just a follow-up question regarding the prediction from the spatial model. I am considering splitting the data into a training set and a test set. Then, fitting the model using the training data and the distance matrix derived from observations in the training data. For testing, would it make sense to create a distance matrix for the test data, combine it as columns in the test data, and use it to evaluate the predictions generated by the fitted model?