dmlc/xgboost

Add Python Interface: XGBRanker and XGBFeature

bigdong89 opened this issue · 4 comments

XGBRanker and XGBFeature are available in my repository XGBoostExtension.

I will request a code merge if necessary.
Currently there is only python-package. If you are familiar with R, please help me write R-package.

Why we need XGBRanker?

  1. Python rank example is not available.
  2. Original rank example is too complex to understand and not easy to call. Too many people don't know how to use XGBoost to rank on StackOverflow.

Why we need XGBFeature?

  1. With Facebook's method using GBDT+LR to improve CTR, we need to get predicted value of every tree as features.
  2. Currently, we use the funciton 'apply' to get predicted values of trees. Howerver, it can't work together with Scikit Learn directly. Because Scikit Learn only use 'transform/fit/predict' interfaces.

In conclusion, we need a simple wrapper to help users to use ranking & predicted_values.

Now I'm working on it. I can request a code merge if necessary.

Hi,
I've got a problem while launching cross validation in the 'rank:pairwise' setting.

After setting up the DMatrix and using the set_group() method (I've passed to this method a numpy.array data structure), I've encountered a problem while CrossValidation.
Here is my Python source code:

xgdmat = xgb.DMatrix(X_training, y_training) # Create our DMatrix to make XGBoost more efficient
xgdmat.set_group(group=groups_query_id) # Set the query_id values to DMatrix data structure

model_parameters = {'objective': 'rank:pairwise', 'seed': 0, 'booster' : ['gbtree', 'gblinear, dart'],
'eta': [0.1, 0.2, 0.3, 0.4, 0.5], 'gamma' : [0, 1],
'subsample': [0.5, 0.75, 0.9],
'max_depth': [3, 5], 'min_child_weight': 1, 'max_delta_step' : 0,
'colsample_bytree': [0.5, 0.75, 0.9], 'colsample_bylevel' : [0.5, 0.75, 0.9],
'lambda' : 1, 'alpha' : 0, 'tree_method' : ['auto', 'exact', 'approx', 'hist']}

cv_xgb = xgb.cv(params=model_parameters, dtrain=xgdmat, num_boost_round=1000, nfold=10, metrics=['auc', 'ndcg', 'map'], early_stopping_rounds=100) #THE PROBLEMS OCCUR HERE!!!

print cv_xgb.tail(5)

final_gb = xgb.train(model_parameters, xgdmat, num_boost_round=500)
When I launch this program, I find this kind of problem:

[15:43:58] dmlc-core/include/dmlc/logging.h:235: [15:43:58] src/c_api/c_api.cc:342: Check failed: (src.info.group_ptr.size()) == (0) slice does not support group structure
Traceback (most recent call last): File "/Users/edoardo/PycharmProjects/MasterThesisProject/extra/Prova.py", line 225, in metodo3() File "/Users/edoardo/PycharmProjects/MasterThesisProject/extra/Prova.py", line 164, in metodo3 metrics=['auc, ''ndcg', 'map'], early_stopping_rounds=100) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.py", line 371, in cv cvfolds = mknfold(dtrain, nfold, params, seed, metrics, fpreproc, stratified, folds) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/training.py", line 248, in mknfold dtrain = dall.slice(np.concatenate([idset[i] for i in range(nfold) if k != i])) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/core.py", line 531, in slice ctypes.byref(res.handle))) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xgboost/core.py", line 127, in _check_call raise XGBoostError(_LIB.XGBGetLastError()) xgboost.core.XGBoostError: [15:43:58] src/c_api/c_api.cc:342: Check failed: (src.info.group_ptr.size()) == (0) slice does not support group structure

How can I solve this problem?

@bigdong89 It would be very very useful XGBRanker! I hope to use this very soon! ;)
You have right: no-one can work correctly without any problems in 'rank:pairwise' setting!

@TheEdoardo93 XGBRanker and XGBFeature are available in my repository XGBoostExtension.
Please try it and tell me whether it is easy enough.

@TheEdoardo93

" xgdmat.set_group(group=groups_query_id) # Set the query_id values to DMatrix data structure "

It is wrong to set groups_query_id ! You should set the group like:
group=[ query_id0_num, query_id1_num, query_id2_num, ...]