dmitryikh/leaves

Total incorrect python xgboost train model,use leaves load model and predict

luowencai opened this issue · 2 comments

We use spark to generate libsvm file, then use python sklearn to load it and xgboost to train and save model, finaly use leaves load it and predict.
the predict result was total incorrect between python demo and go.
just want to ask if leve not support or we use leaves wrong.
the python code like:

my_workpath = 'D:\\project\\py\\train_demo\\'
X_train, y_train = load_svmlight_file(my_workpath + 'train')
X_test, y_test = load_svmlight_file(my_workpath + 'validation')
bst = XGBClassifier()
bst.fit(X_train, y_train)
bst.save_model(my_workpath + "train_model")
train_preds = [x[1] for x in bst.predict_proba(X_train)]
test_preds = [x[1] for x in bst.predict_proba(X_test)]

the go code like:

model, e := leaves.XGEnsembleFromFile(model_path,true)
	if e != nil{
		println(e)
	}
	if model.Transformation().Type() != transformation.Logistic {
		log.Fatalf("expected TransforType = Logistic (got %s)", model.Transformation().Name())
	}
	csr, err := mat.CSRMatFromLibsvmFile(validate_path, 0, true)
	if err != nil{
		println(err)
	}
	predictions := make([]float64, csr.Rows()*model.NOutputGroups())
	e = model.PredictCSR(csr.RowHeaders, csr.ColIndexes, csr.Values, predictions, 50, 5)
	if e != nil{
		println(e)
	}
	fmt.Printf("Prediction for %v\n", predictions)

Hello! Thank for your report.

e = model.PredictCSR(csr.RowHeaders, csr.ColIndexes, csr.Values, predictions, 50, 5)

why do you use only 50 trees to predict? Try use all tress in ensemble, like in python script.

Also, If you can provide your train & test files, I can check the case precisely.

Sorry for the mistake python code, here's the right python code we actually use:

from sklearn.datasets import load_svmlight_file
from xgboost import XGBClassifier


class train_classifier:
    bst = XGBClassifier(max_depth=8, n_estimators=50, learning_rate=0.1, silent=False, objective='binary:logistic',
                        min_child_weight=3, gamma=0, scale_pos_weight=45.1193405554875, subsample=0.9,
                        colsample_bytree=0.6, reg_alpha=3, reg_lambda=3, verbose=False)
    my_workpath = 'D:\\project\\py\\train_demo\\'

    def __init__(self):
        self.bst.load_model(self.my_workpath + "train_model")

    def train(self, train_path='train'):
        X_train, y_train = load_svmlight_file(self.my_workpath + train_path)
        self.bst.fit(X_train, y_train)
        self.bst.save_model(self.my_workpath + "train_model")

    def test_predict(self, test_file='validation'):
        X_test, y_test = load_svmlight_file(self.my_workpath + test_file)
        return [x[1] for x in self.bst.predict_proba(X_test)]

Here's the predict result we run python predict and go predict_csr:
predict_result.zip