StatMixedML/XGBoostLSS

Serialization issue

Closed this issue · 8 comments

edgBR commented

Dear Alexander,

We are now comparing MAPIE confidence intervals to a custom methodology we have created based on the XGBoostLSS.

Basically we are using the expectiles that XGBoostLSS provides us to fit a CDF that is an approximation to a tweedie CDF. (This is a bit hacky but we are getting theoretical coverage).

The problem is that we are splitting this procedure in 2 parts and we are using the joblib of XGBoostLSS to the input of the CDF approximation process.

I have not seen any example of how to serialize XGBoostLSS objects but when trying with joblib:

        X_train_ci = processor_reg.transform(regression_df.drop(columns=[arguments.clf_targets] +
                                                                                [arguments.reg_targets])) 
        y_train_ci = regression_df[arguments.reg_targets]
        n_cpu = multiprocessing.cpu_count()
        dtrain = xgb.DMatrix(X_train_ci, label=y_train_ci, nthread=n_cpu)

        logging.info("Training XGBoost")
        xgboostlss_model_expectile = xgboostlss.train(hyperparameters,
                                                      dtrain,
                                                      dist=distribution_expectile,
                                                      num_boost_round=hyperparameters["opt_rounds"],
                                                      verbose_eval=True)
        
        if partial:
            logging.info("Serializing partial fit")
            joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf_partial_fit.joblib')
        else:
            logging.info("Serializing full fit")
            joblib.dump(xgboostlss_model_expectile, 'outputs/models/zip_conf.joblib')

We are getting the following error:

image

Being distribution expectile defined as:

        distribution_expectile = Expectile                                   
        distribution_expectile.expectiles =  arguments.exp_list
        distribution_expectile.stabilize = "MAD" 

Are we choosing the wrong serialization method? Do we need to save XGBoostLSS in a different way?

BR
Edgar

@edgBR Thanks for raising the issue.

It seems like the problem is with the starting values. Does the problem occur for the predict function only or also during model training?

Also, please be aware that expectile bands are usually a little narrower as compared to quantiles. I am not sure if that affects your problem at hand.

If you prefer Ray over joblib, you can try and give it a go. It seems like, though, that you need to reformulate the function definition of the custom loss and evaluation function. If needed, I can try and find some time and provide an example. But that might not be something to come soon though, unfortunately ...

edgBR commented

Hi @StatMixedML

This only happens when I save the model as joblib and I load it again to make predictions.

When training and predicting within the same session/notebook etc... Everything works as expected.

Unfortunately I do not see how ray can help here. Do you have an example where you save the model and load it again to make predictions?

PS: We know that expectiles bands are narrower but precisely because of this, if fits our use perfectly.

BR
Edgar

This only happens when I save the model as joblib and I load it again to make predictions.

When training and predicting within the same session/notebook etc... Everything works as expected.

@edgBR Thanks for clarifying. I have never really used XGBoostLSS in combination with joblib. Is there a chance you can provide a reproducible code example?

Thanks

edgBR commented

Ey,

Yes, please find attached:

!pip install git+https://github.com/StatMixedML/XGBoostLSS.git
import numpy as np
import pandas as pd
import pkg_resources
import itertools
import shap 
import math
import multiprocessing
from scipy.stats import norm
import matplotlib.pyplot as plt
import plotnine
from plotnine import *
plotnine.options.figure_size = (20, 10)

from xgboostlss.model import *
from xgboostlss.distributions.Expectile import Expectile
from xgboostlss.datasets.data_loader import load_simulated_data

# The data is a simulated Gaussian as follows, where x is the only true feature and all others are noise variables
    # loc = 10
    # scale = 1 + 4*((0.3 < x) & (x < 0.5)) + 2*(x > 0.7)

train, test = load_simulated_data()
n_cpu = multiprocessing.cpu_count()

X_train, y_train = train.iloc[:,1:],train.iloc[:,0]
X_test, y_test = test.iloc[:,1:],test.iloc[:,0]

dtrain = xgb.DMatrix(X_train, label=y_train, nthread=n_cpu)
dtest = xgb.DMatrix(X_test, nthread=n_cpu)


np.random.seed(123)

n_rounds = opt_params["opt_rounds"]
del opt_params["opt_rounds"]

# Train Model with optimized hyper-parameters
xgboostlss_model = xgboostlss.train(opt_params,
                                    dtrain,
                                    dist=distribution,
                                    num_boost_round=n_rounds)

import joblib
joblib.dump(xgboostlss_model, 'my_model.joblib')

Now we close our session and we restart a new one and we run:

import numpy as np
import pandas as pd
import pkg_resources
import itertools
import shap 
import math
import multiprocessing
from scipy.stats import norm
import matplotlib.pyplot as plt
import plotnine
from plotnine import *
plotnine.options.figure_size = (20, 10)

from xgboostlss.model import *
from xgboostlss.distributions.Expectile import Expectile
from xgboostlss.datasets.data_loader import load_simulated_data

# The data is a simulated Gaussian as follows, where x is the only true feature and all others are noise variables
    # loc = 10
    # scale = 1 + 4*((0.3 < x) & (x < 0.5)) + 2*(x > 0.7)

train, test = load_simulated_data()
n_cpu = multiprocessing.cpu_count()

X_train, y_train = train.iloc[:,1:],train.iloc[:,0]
X_test, y_test = test.iloc[:,1:],test.iloc[:,0]

dtrain = xgb.DMatrix(X_train, label=y_train, nthread=n_cpu)
dtest = xgb.DMatrix(X_test, nthread=n_cpu)

distribution = Expectile  
distribution.expectiles = [0.05, 0.95]     # Expectiles to be estimated: needs to be a list of at least two expectiles.
distribution.stabilize = "MAD"       

my_model= joblib.load('my_model.joblib')

predictions = xgboostlss.predict(xgboostlss_model, 
                                    dtest, 
                                    dist=distribution,
                                    pred_type="expectiles")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-10-61103a19a665>](https://localhost:8080/#) in <module>()
      2                                     dtest,
      3                                     dist=distribution,
----> 4                                     pred_type="expectiles")

[/usr/local/lib/python3.7/dist-packages/xgboostlss/model.py](https://localhost:8080/#) in predict(booster, dtest, dist, pred_type, n_samples, quantiles, seed)
    330 
    331         # Set base_margin as starting point for each distributional parameter. Requires base_score=0 in parameters.
--> 332         base_margin = (np.ones(shape=(dtest.num_row(), 1))) * dist.start_values
    333         dtest.set_base_margin(base_margin.flatten())
    334 
AttributeError: type object 'Expectile' has no attribute 'start_values'

Looking to the code we are trying now to initialize the start values as:

distribution.start_values = [np.mean(target_train), np.mean(target_train)]

But we are wondering if this is the right way.

@edgBR The following fixes the problem

import numpy as np
import pandas as pd
import pkg_resources
import itertools
import shap 
import math
import multiprocessing
from scipy.stats import norm
import matplotlib.pyplot as plt
import joblib
import plotnine
from plotnine import *
plotnine.options.figure_size = (20, 10)

from xgboostlss.model import *
from xgboostlss.distributions.Expectile import Expectile
from xgboostlss.datasets.data_loader import load_simulated_data

# The data is a simulated Gaussian as follows, where x is the only true feature and all others are noise variables
    # loc = 10
    # scale = 1 + 4*((0.3 < x) & (x < 0.5)) + 2*(x > 0.7)

train, test = load_simulated_data()
n_cpu = multiprocessing.cpu_count()

X_train, y_train = train.iloc[:,1:],train.iloc[:,0]
X_test, y_test = test.iloc[:,1:],test.iloc[:,0]

dtrain = xgb.DMatrix(X_train, label=y_train, nthread=n_cpu)
dtest = xgb.DMatrix(X_test, nthread=n_cpu)

distribution = Expectile  
distribution.expectiles = [0.05, 0.95]     # Expectiles to be estimated: needs to be a list of at least two expectiles.
distribution.stabilize = "MAD"   
distribution.start_values = distribution.initialize(dtrain.get_label())

xgboostlss_model= joblib.load('my_model.joblib')

predictions = xgboostlss.predict(xgboostlss_model, 
                                 dtest, 
                                 dist=distribution,
                                 pred_type="expectiles")
predictions.head()

The only thing I changed, compared to your suggestion, is distribution.start_values = distribution.initialize(dtrain.get_label()) . The issue stems from this, i.e., starting values are initialized during model training only, and hence not available after you re-load the model. Hence, since you load the trained model, this causes the problem, since dist.start_values are not initialized since model training is already done in the prior step.

Hope this solves the problem to some extent. If you have soms suggestion on how to better pass the starting values to the model let me know.

edgBR commented

Hi,

This works but this is very hacky imho. My suggestion would be to not return a booster object in xgboostlss.train. Instead we can return a custom class where inside the self we have the booster and also the distribution object.

What do you think?

@edgBR Having a dedicated XGBoostLSS class is also my preferred option. Yet, I haven't had time to do that...

Closing the issue for now, using the interim solution.