jasp-stats/jasp-issues

[Bug]: The prediction module does not load the prediction table

vladimirsim opened this issue · 8 comments

JASP Version

0.19.1

Commit ID

84d54b934fa27731bb9eec44a4aa5f7ab0744dfd

JASP Module

Machine Learning

What analysis are you seeing the problem on?

Machine Learning > Prediction

What OS are you seeing the problem on?

Windows 11

Bug Description

This bug report is probably related to Bug report #2978 which was also submitted by me 3 weeks ago and it is closed now.
I am using JASP 0.19.2.0 which is a nightly build.
I open a Training database (CSV file) and train a Random forest model. Then I save the trained model. Then I open a test database which is exactly the same format as the training dataset (I know for sure because they were part of the same worksheet which I split into a training and a test dataset). Then I load the trained model and I try to get predictions for the test dataset, but the Prediction table does not load. The error says that a predictor in the test data set is of different format from its format in the training dataset. But this is simply not true.

image

Expected Behaviour

The Prediction table should load.

Steps to Reproduce

  1. Open JASP and load the IndivLoansTrainingSample.csv file (cannot attach it because it is confidential)

  2. Review the uploaded data file. JASP automatically assigns a type to each variable. It assigns Ordinal type to some variables, but it seems that Ordinal is not acceptable in the Random forest model, is it? I think this is a bug.

  3. Optional: Change the type of some variables. For example, from Ordinal to Nominal or from Nominal to Scale. If JASP erroneously considers a Scale variable as Nominal, it will have a huge effect on the Random Forest model, will it not?

  4. Open the Machine Learning module and train a Random forest model. (I am attaching the trained model)

  5. Open another instance of JASP and load the IndivLoansTestSample.csv file (cannot attach it because it is confidential)

  6. Review the uploaded data. Make sure that all variables in the Test dataset are exactly the same format as in the Training dataset. Change data types if needed.

  7. Load the Machine Learning>Prediction>Prediction module

  8. Load the trained model

  9. Build the prediction table by picking the right predictors from the trained model. When you add all of the required predictors from the training model (not all variables in the dataset were used to train a model), you will get a message that a variable is in a different format. stop('Type of predictors in new data do not match that of the training data.')

  10. The bug is still there even if I do not alter the type of any variable in the Training and Test sets. Even if I pick only several among those variables which were correctly recognized by JASP, I still get the same error, which is absurd because the predictors' type was automatically recognized by JASP and they were the same type, I checked it many times!

  11. I have a suggestion: in the Machine Learning>Prediction>Prediction module, when I load the trained model, the prediction table tells me which predictors it expects me to load. Why don't you add the type of variable which JASP expects for each predictor? Let's say Length (Scale), TypeOfBondage (Nominal), etc. This will save a lot of nerves!

  12. Also, if Ordinal variables are not acceptable in the Random Forest algorithm, why don't you prevent their usage?
    ...

Log (if any)

-------- Application Info --------
JASP Version: JASP 0.19.2
Build Branch: HEAD
Build Date: Nov 26 2024 18:09:03 (Netherlands)
Last Commit: 84d54b934fa27731bb9eec44a4aa5f7ab0744dfd

-------- Basic Info --------
Operating System: Windows 11 Version 23H2
Product Version: 11
Kernel Type: winnt
Kernel Version: 10.0.22631
Architecture: x86_64
Install Path: D:/Program Files/JASP
Platfotm Name: windows
System Local: bg_BG

-------- Extra Info --------
Current code page
Active code page: 437
Active code page: 65001

Host Name: SHOSHOCI
OS Name: Microsoft Windows 11 Pro
OS Version: 10.0.22631 N/A Build 22631
OS Manufacturer: Microsoft Corporation
OS Configuration: Standalone Workstation
OS Build Type: Multiprocessor Free
Registered Owner: 359898893538
Registered Organization:
Product ID: 00330-52813-47920-AAOEM
Original Install Date: 31.1.2023 г., 12:22:04
System Boot Time: 27.11.2024 г., 9:44:36
System Manufacturer: LENOVO
System Model: 82LM
System Type: x64-based PC
Processor(s): 1 Processor(s) Installed.
[01]: AMD64 Family 23 Model 104 Stepping 1 AuthenticAMD ~2100 Mhz
BIOS Version: LENOVO G5CN64WW(V2.10), 6.10.2022 г.
Windows Directory: C:\Windows
System Directory: C:\Windows\system32
Boot Device: \Device\HarddiskVolume1
System Locale: en-us;English (United States)
Input Locale: en-us;English (United States)
Time Zone: (UTC+02:00) Helsinki, Kyiv, Riga, Sofia, Tallinn, Vilnius
Total Physical Memory: 15 706 MB
Available Physical Memory: 8 855 MB
Virtual Memory: Max Size: 16 730 MB
Virtual Memory: Available: 7 726 MB
Virtual Memory: In Use: 9 004 MB
Page File Location(s): C:\pagefile.sys
Domain: WORKGROUP
Logon Server: \SHOSHOCI
Hotfix(s): 5 Hotfix(s) Installed.
[01]: KB5045935
[02]: KB5012170
[03]: KB5027397
[04]: KB5046633
[05]: KB5044620
Network Card(s): 2 NIC(s) Installed.
[01]: Realtek 8822CE Wireless LAN 802.11ac PCI-E NIC
Connection Name: Wi-Fi
Status: Media disconnected
[02]: Realtek USB GbE Family Controller
Connection Name: Ethernet
DHCP Enabled: Yes
DHCP Server: 192.168.1.1
IP address(es)
[01]: 192.168.1.14
[02]: fe80::f73:9fc3:5374:b2df
[03]: fda9:de81:d862:0:bdaa:acda:e64e:528a
[04]: fda9:de81:d862:0:d3ed:28cb:8c0e:2133
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.

JASP 2024-11-27 14_21_05 Desktop.log
JASP 2024-11-27 14_21_05 Engine 1.log

More Debug Information

This is the error message which I get when the Prediction table fails to load:

This analysis terminated unexpectedly.

Error in randomForest:::predict.randomForest(model, newdata = dataset): Type of predictors in new data do not match that of the training data.

Stack trace
analysis(jaspResults = jaspResults, dataset = dataset, options = options)

.mlPredictionsTable(model, dataset, options, jaspResults, ready, position = 2)

.mlPredictionsState(model, dataset, options, jaspResults, ready)

createJaspState(.mlPredictionGetPredictions(model, dataset))

jaspStateR$new(object = object, dependencies = dependencies)

initialize(...)

.mlPredictionGetPredictions(model, dataset)

.mlPredictionGetPredictions.randomForest(model, dataset)

randomForest:::predict.randomForest(model, newdata = dataset)

stop('Type of predictors in new data do not match that of the training data.')

To receive assistance with this problem, please report the message above at: https://jasp-stats.org/bug-reports

Final Checklist

  • I have included a screenshot showcasing the issue, if possible.
  • I have included a JASP file (zipped) or data file that causes the crash/bug, if applicable.
  • I have accurately described the bug, and steps to reproduce it.

And here is the trained model, sorry, I forgot to attach it to my original publication!
27112024RFIndivLoansFincaJor.zip

I'm afraid that without the dataset we cannot do a deep dive into this problem, can you change the values in the data so that they are unrecognisable and then attach it?

I do see that the error comes from randomForest:::predict.randomForest, particularly from the line

if (!all(object$forest$ncat == cat.new)) 
      stop("Type of predictors in new data do not match that of the training data.")

which checks if the number of categories in the nominal predictor variables in the training set (object$forest$ncat) are equal to the number of categories in the nominal predictor variables in the prediction data (cat.new).

Do you have categories of the nominal variables in the training data that do not occur in the prediction data, or vice versa?

Do you have categories of the nominal variables in the training data that do not occur in the prediction data, or vice versa?

Yes, I do. The training set is 'broader' than the prediction set. I have checked it a variable by variable, and made sure that all categories in the prediction set already appeared in the training set. For example, if it comes to the nominal variable 'city', in the training set I might have loans from cities A, B and C, while in the prediction set I might have loans only from cities B and C. But I don't expect that this is a problem. I suppose a problem would arise if it was the other way around - if I wanted to get a prediction for a category on which the model was not trained, right?

I’m wouldn’t expect this to be a problem either, but it is in the randomForest code ;) could you verify if this is the problem by making some additional rows with those missing levels of the nominal variables in the prediction data set?

You were right, @koenderks ! In the prediction set, I removed 2 nominal variables which had fewer categories than in the training set, and the prediction table loaded. The prediction module worked well and I was able to generate and export predictions to a csv file. But I still believe that this is a bug.
I came upon another failure while working with the prediction set: when I checked the Explain predictions checkbox, I got the following error: no applicable method for 'predict' applied to an object of class 'randomForest'. Please see the screenshot.

image

I do not understand what this error is due to. I am attaching the log files, too.

JASP Log files vladimirsim.zip

I think we resolved this last bug that you report in jasp-stats/jaspMachineLearning#393.

But I agree, the check in randomForest for equal levels seems a bit overkill to me. I guess we could fix it by manually assigning the factors in the prediction data the same levels as those in the training data, even though they do not exist in the prediction data. @vandenman Do you see any problems with this?

Here is a reproducible example:

Random forest regression model trained on the iris dataset to predict Sepal.Length based on Sepal.Width, Petal.Length, Petal.Width and Species.
model.jaspML.zip

Prediction dataset where Species has only 2 factor levels:
prediction_data.csv

image

This pull request should fix the issue! I confirmed that the explain predictions table also works in the latest version of JASP (0.19.2, coming out soon).

image