Document is not clear on 'topk' and 'mean' for lambdarank_pair_method for lambda rank pair construction
nsh-bay opened this issue · 1 comments
nsh-bay commented
Hi team,
I have a few questions on this document for learn to rank https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html,
- I couldn't find how can I run an exhaustive pairs construction with lambdarank_pair_method='mean' or 'topk'. This is my ultimate goal. Note that the number of documents for each query varies.
- What is the default
k
(lambdarank_num_pair_per_sample
) fortopk
andmean
method?
When I left it default, the model.json shows thelambdarank_num_pair_per_sample
is full 32bit number (screenshot). Is it a bug?

- I assume that setting
topk
and set thek
vialambdarank_num_pair_per_sample
very large number (e.g., -1 or 1000) can help me achieve the goal in question 1, but I am not sure how it behaves iflambdarank_num_pair_per_sample
is set to a number higher than number of documents for every queries. - The example with the
mean
method is a bit tricky to me that if we have 3 documents , typically we only need 2c3=3 pairs at most but the example showed we can generatelambdarank_num_pair_per_sample
*#documents
= 2*3 = 6.- a. That means there are duplicates pairs in this case? if I set method as
mean
andlambdarank_num_pair_per_sample
is very large, does it affects the training time significantly because of that duplicates? - b. How to set it to archive question 1 above?
- a. That means there are duplicates pairs in this case? if I set method as
- Here is the example quote in the document.
For the mean strategy, XGBoost samples lambdarank_num_pair_per_sample pairs for each document in a query list. For example, given a list of 3 documents and lambdarank_num_pair_per_sample is set to 2, XGBoost will randomly sample 6 pairs, assuming the labels for these documents are different. On the other hand, if the pair method is set to topk, XGBoost constructs about number of pairs with pairs for each sample at the top position. The number of pairs counted here is an approximation since we skip pairs that have the same label.
- If I select
topk' method with
lambdarank_num_pair_per_sample=2` and my query have 4 documents, says ranked d1-d4.- a. What pairs will be constructed? (d1-d2), (d1d3), (d1-d4), (d2-d3), (d2-d4) ?
- b. The document says it will construct
k
*|query|
, so it should be2*4=8
, how will they be constructed ?
Here is one of my GBM setting and environment:
- xgb.version :2.1.2 (CPU only)
- Labels is floating point values
'ndcg_exp_gain': False,
'objective': 'rank:ndcg',
'lambdarank_pair_method':'topk',
'lambdarank_num_pair_per_sample':10000,
'verbosity': 1,
'grow_policy': 'lossguide',
'learning_rate': 0.3,
'max_depth': 6,
'min_child_weight': 0.0,
'subsample': 0.5,
'tree_method': 'approx',
'max_bin': 256,
'gamma': 0,
'reg_lambda': 1.0,
'reg_alpha':0.0,
'max_leaves': 32,
'random_state': 999,
'n_jobs': -1
Thank you very much.
trivialfis commented
I couldn't find how can I run an exhaustive pairs
For now, set it to a number larger than existing groups?
What is the default k
1 if random sampling, 32 if top k.
Is it a bug?
it's an internal indicator for "not-set".
The example with the mean method is a bit tricky
Randomly select k documents, and pair them with all other existing documents in the group.