Small issues with ordinal attributes & forests

Question

Small issues with ordinal attributes & forests

vidalt opened this issue 4 years ago · 4 comments

Hello Amir, thanks for the amazing work!
Installing the package went like a breeze, yet I am facing a few errors when running some of the suggested examples:

Running 'python batchTest.py -d credit -m mlp -n one_norm -a MACE_eps_1e-3 -b 0 -s 1'
In this case, I got an assertion error due to insufficient accuracy. Does not seem a big issue, perhaps some bad luck during training.
Running ''python batchTest.py -d credit -m forest -n one_norm -a MACE_eps_1e-3 -b 0 -s 1'
In this case, I get the following error: "ValueError: Number of features of the model must match the input. Model n_features is 20 and input n_features is 14". After some analysis, I have the impression that this error occurs due to two "ordinal" attributes in the data set that seems to trigger a one-hot encoding, and later some inconsistencies between different data structures.
Running ''python batchTest.py -d credit -m forest -n one_norm -a FT -b 0 -s 1'
This seems to lead to an error that is similar to the previous test.

Am-I correctly using the code, and is there a way to fix this issue?
Thanks a lot!
--Thibaut

Answer 1 · 2021-02-01T13:29:41.000Z

Dear Vidal,

Thank you for raising these issues. It seems that you are testing the code with built-in model classes and datasets, and upon investigation, (fortunately/unfortunately) I was able to repro the bug. This has now been resolved and a hot-fix has been pushed to master which should resolve issues 2,3. Can you please try running it again?

For issue 1, indeed, we have an arbitrary check that the underlying model is somewhat useful (i.e., above 70% accuracy). You can disable this in loadModel.py where you see assert accuracy_score(y_train, model_trained.predict(X_train)) > 0.70.

Please feel free to close this issue if your problem has been resolved.

Best,
Amir

Answer 2 · 2021-02-01T14:38:11.000Z

Hello Amir,
Thanks a lot, it perfectly fixed the problem! I will therefore close this issue.
A last quick question: you mentioned that I used "built-in model classes and datasets". I indeed searched quite a bit on how to feed an external dataset to the algorithm, but noted that some configurations (actionability, bounds, data types) are directly defined in the loading functions associated with each data set. To include any new data set, should I create similar loading functions for each case, or is there a default behavior that is planned in the code?
Thanks again,
--Thibaut

Answer 3 · 2021-02-01T15:06:01.000Z

Thanks for the follow-up. Because there is no general way to automatically recognize the semantics of each variable, we relegate this to the individual. Thus, to add a new dataset, please update loadData.py and create a corresponding file under _data_main/ as you had suggested.

Hope this helps!

Answer 4 · 2021-02-01T15:16:57.000Z

Sounds good, thanks a lot!