ModelEvaluatorBuilder restriction on input fields count
Closed this issue · 1 comments
Build method of ModelEvaluatorBuilder class invokes checkSchema routine which throws an exception while input fields count breaks 1000
if((inputFields.size() + groupFields.size()) > 1000){ throw new InvalidElementException("Model has too many input fields", miningSchema); }
What is the reason for this check? What could be wrong about models with more inputs and why threshold hardcoded to be 1000?
I know lots of models with thousands of input fields working great.
Could this condition be at least configurable or removed completely?
Closing as exact duplicate of #44 and #95
What is the reason for this check?
If you let people to do stupid things, they will do stupid things.
What could be wrong about models with more inputs and why threshold hardcoded to be 1000?
That's a clear indication of a poorly encoded model.
For example, suppose you're working with a dataset that contains a categorical feature with 1000 category levels. Some ML frameworks (such as Scikit-Learn and Apache Spark) require you to one-hot-encode this feature, thereby expanding this single categorical feature into 1000 binary features.
A good PMML converter would undo the one-hot-encoding, and give you a PMML document that specifies 1 input field in its schema. In contrast, a bad PMML converter can not/will not undo it, and gives you a PMML document that specifies 1000 input fields instead.
When you try to evaluate this poorly encoded PMML document then you'll experience 100-1000x performance hit. And then you come back here and raise a bug report complaining that the JPMML-Evaluator library is not good enough.
I know lots of models with thousands of input fields working great.
I bet these field counts could be reduced significantly by (re-)encoding these models properly.
Could this condition be at least configurable or removed completely?
Go ahead and disable this check in your own application code.