Not happy with XGBoost evaluation performance

Question

Not happy with XGBoost evaluation performance

paranjapeved15 opened this issue 2 years ago · 3 comments

Our Setup
We are running inference on a few thousand samples at a time in a kotlin microservice with jpmml.
There are about 7-8 input features of numeric type.
Target variable is a probability value.
Multi threading: We have multiple threads each calling the model evaluator#evaluate
JPmml Version: 1.5.15

Model details
Our model is an XGB Classifier, with learning rate 0.02, ~315 number of estimators, max depth = 30 and and monotone constraints for 2 of the features.
We have created a DataFrameMapper through which we are doing some simple manipulations (eg. value imputations/clipping of features).
We have converted the xgb model to a PMML format using the sklearn2pmml package.
Previously we were using a logistic regression classifier.

Size Comparison
Logistic Regression Pmml Size: 25 kb
Tree based model Pmml Size: 16 mb

Inference Performance
Logistic regression models on an average were performing much better, each sample was taking about 100 nano seconds to score.
New Tree based models on an average are taking about 2 milli seconds, ie about 20 times more time.

Questions
Is it just the sheer size of the pmml which is deteriorating performance?
Is it possible to improve the inference time of the tree based models in some way?
I read on one of your issues that vector processing is not possible in a java environment. Is there any other way we can improve parallel processing of the models?

Answer 1 · 2023-01-05T17:39:57.000Z

Is it just the sheer size of the pmml which is deteriorating performance?

Based on your report:

Logistic regression - file size 25 kB, avg. evaluation time 0.1 millis.
XGBoost - file size 16 MB, avg. evaluation time 2 millis.

In other words, the complexity of the model grew 640 X, whereas the evaluation time grew 20 X. Not a bad trade-off, I would say.

Is it possible to improve the inference time of the tree based models in some way?

First, while exporting your Python model using SkLearn2PMML, it is possible to choose between different representations. Some reps optimize for readability (human-friendliness), some for evaluation performance.

For example, XGBoost models can represent splits in different ways, plus XGBoost decision trees themselves can be rearranged to make them flatter and more compact.

Second, for performance critical stuff, use JPMML-Transpiler library (on top of the JPMML-Evaluator library).

I read on one of your issues that vector processing is not possible in a java environment

You can vectorize linear algebra operations (eg. logistic regression), but you can't vectorize conditional operations (eg. decision trees such as XGBoost).

Doesn't matter what's the front-end API (native XGBoost on GPU vs. JPMML on CPU), the evaluation of XGBoost models always happens one data data record-at-a-time.

Is there any other way we can improve parallel processing of the models?

Review your Python data science workflow.
When exporting Python objects to PMML using the SkLearn2PMML package, choose a performance-oriented model representation.
Use JPMML-Transpiler when the model is transpile-able (XGBoost falls nicely into this category).
Buy my professional consultation services.

Answer 2 · 2023-01-05T19:57:32.000Z

Thanks @vruusmann for your recommendations.
Question: Does jpmml library have provision to execute each tree in the ensemble in parallel?

Answer 3 · 2023-01-05T20:38:20.000Z

Question: Does jpmml library have provision to execute each tree in the ensemble in parallel?

Currently, NO.

The thinking is that it would consume more time to co-ordinate work between threads, than it takes to do the work in one thread.

In your example, it takes 20 millis to do 315 elementary trees. That is ~0.00635 millis (6.35 micros) per tree. What is your estimate, how much time it would take to split/join this between 315 threads? Multi-threading won't make a single tree evaluate any faster.

Sure, perhaps there's a reasonable trade-off by splitting the work between 3 threads (not 315), so that each thread does ~105 trees.

But since your application scenario is about batch scoring 1000 data records, then you better figure out a mechanism for dividing them between the right amount of threads (while treating the evaluation of each data record as an atomic operation).