jasp-stats/jasp-issues

[Bug]: Logistic/Multinomial Regression classification table does not load

vladimirsim opened this issue · 14 comments

JASP Version

0.19.1

Commit ID

84d54b934fa27731bb9eec44a4aa5f7ab0744dfd

JASP Module

Machine Learning

What analysis are you seeing the problem on?

Classification>Logistic/Multinomial

What OS are you seeing the problem on?

Windows 11

Bug Description

It is me again, sorry! :) I use the latest Nightly file from yesterday evening.
I am trying to train a Logistic/Multinomial Regression classification model but I get a 'factor xxx has new levels' error. I got this error for several different factors. What is very strange, is that when I add factors one by one, the problematic factor might load and the logistic regression classification will work, however when I load the next one or two variables (one by one or together), the error appears for a previously loaded factor.

Expected Behaviour

After the factor (variable) is loaded in the 'Features' compartment, the Logistic/Multinomial Regression classification table should load.

Steps to Reproduce

  1. Train a model for logistic regression classification.
  2. Once you start adding factors in the model, the error appears.

...

Log (if any)

Logs:
JASP 2024-11-28 17_03_34 Engine 0.zip

More Debug Information

Printscreen:
image

Final Checklist

  • I have included a screenshot showcasing the issue, if possible.
  • I have included a JASP file (zipped) or data file that causes the crash/bug, if applicable.
  • I have accurately described the bug, and steps to reproduce it.

Welcome back ;) It's good that you are testing this very thoroughly. Unfortunately I cannot reproduce this with another dataset... I did find this discussion https://stackoverflow.com/questions/22315394/factor-has-new-levels-error-for-variable-im-not-using on this issue, and I think it is caused by the test or training set having some variables with different factors. What if you fix the seed (default is random) in the advanced options so that you take the same training set every time, and then check if you can get it to work consistently.

I tried to fix the seed but it didn't work. The same 'factor xxx has new level z' error appears. However, I discovered an interesting pattern. This error appears on probably 6 or 7 factors and for all of them except one the error appears for a level which has only one instance. For example, I get the 'factor KnowingCode has new level 7' and when I check the training set it turns out that only one loan has KnowingCode value of 7. Another example: 'factor Education has new level Master'. It turns out that only one loan has Education equal to Master in the training set. The same is valid for 4-5 other factors. There is only one factor which causes this error and which has more than one instance for each of its levels.

Do you think you can make an example dataset with which I can reproduce this error? It is very difficult to debug this in my mind without a concrete example for which I can also test potential solutions.

Ok, I will provide a dataset which will replicate the problem, but I would like to send it via email for safety reasons. Can you please provide an email address?

@tomtomme can you give me access to this dataset then?

I do Not have Access to that Mailaccount. @EJWagenmakers ?

I don't have access to it either

OK, the problem is that there are nominal variables such as LoanReason and Education, ApprovedCycle and KnowingCode that have some levels that only occur once or twice in the dataset (for example, there is only once instance of Education: "Master"). When these rows appear in the test (or validation) set, the trained model does not know how to make predictions for them (since they did not appear in the training set). I'm not sure why other algorithms allow this, but they probably shouldn't. Best practice would be to ensure that these rows appear in the training set (for instance, using the test set indicator) and that they do not appear in the test set.

I have implemented an error message that suggests this in jasp-stats/jaspMachineLearning#394. See below (this is your emailed dataset with set seed: 1).

image

Unfortunately, I disagree! Please take a second look at my email message. The problem which I report, appears in the training set, while trying to train a model. That problem appears specifically while trying to train exactly Logistic/ Multinomial regression classifier. What you describe in your solution is a valid problem, of course, and it needs to be fixed. But it is a different problem from what I reported.

I know, but part of the model fitting procedure in JASP is making presictions for the test set. That is where the error comes from. I am 100% sure of this because the error comes from predict.glm(), which is not called until after the model is trained and predictions are made for the test srt using the trained model. You can try this yourself by making a test set indicator to ensure that all rows with unique (or few) factor levels are in the training set (test set indicator 0). Let me know what you think. The error should not occur then.

But I used the same training set (with unique levels of some factors) to train models using KNN, Neural Networks, Random forest, Naive Bayes and so on. Never got any error with them - neither while training, nor while predicting. I had made sure that the entry with the unique level was in the training set (the one used to train the model, but I don't know if it was in the training, validation or test part of the training set). I had not fixed the seed while training. Trained many times the same dataset with each algorithm. If your logic is correct, shouldn't have I got the same error with these algorithms?

No I tried this as well with knn and other algorithms allow this apparently (hebce my remark about other algorithms 2 comments ago), which I don’t think they should.

You can actually make the multinomial/logistic algorithm work (using your emailed dataset) by using this test set indicator:
TrainingSampleForJASPDevelopers_testSetIndicator.xlsx. Just paste it into the Excel file that you emailed. Using that indicator under 'Data Split Preferences' --> "Test set indicator" --> select "testSet", I ensured that each level of each factor is represented in the training set.