arogozhnikov/hep_ml

Speedup with XGboost classifier runtime error

adendek opened this issue · 3 comments

I have tried to use speedup.LocukpClassifier with XGboost as a base_estimator but I failed. This may be a bug in LockupClassifier implementation.
I executed following python code:

train_X, test_X, train_Y, test_Y = train_test_split(new_features, target, random_state=42,train_size=0.5 )              

base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
classifier.fit(train_X, train_Y)

And obtained following error code

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-afeadcdc082e> in <module>()
      3 base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
      4 classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
----> 5 classifier.fit(train_X, train_Y)

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/hep_ml/speedup.pyc in fit(self, X, y, sample_weight)
     91         all_lookup_indices = numpy.arange(int(n_parameter_combinations))
     92         all_combinations = self.convert_lookup_index_to_bins(all_lookup_indices)
---> 93         self._lookup_table = trained_estimator.predict_proba(all_combinations)
     94 
     95         if self.keep_trained_estimator:

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
    475         class_probs = self.booster().predict(test_dmatrix,
    476                                              output_margin=output_margin,
--> 477                                              ntree_limit=ntree_limit)
    478         if self.objective == "multi:softprob":
    479             return class_probs

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: [u'seed_chi2PerDoF', u'seed_p', u'seed_pt', u'seed_nLHCbIDs', u'seed_nbIT', u'seed_nLayers', u'seed_x', u'seed_y', u'seed_tx', u'seed_ty', u'abs_seed_x', u'abs_seed_y', u'abs_seed_tx', u'abs_seed_ty', u'seed_r', u'pseudo_rapidity'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15']
expected seed_nbIT, abs_seed_y, abs_seed_x, seed_tx, seed_pt, seed_nLayers, seed_x, seed_y, seed_ty, pseudo_rapidity, seed_p, seed_r, abs_seed_tx, abs_seed_ty, seed_nLHCbIDs, seed_chi2PerDoF in input data
training data did not have the following fields: f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f12, f13, f10, f11, f14, f15

@adendek thanks for reporting, I forgot about this scenario

I've fixed this in the develop branch, try it now:

pip uninstall hep_ml
pip install https://github.com/arogozhnikov/hep_ml/archive/develop.zip

Great! It works!