Extended experiments on multi-output benchmark datasets

Question

Extended experiments on multi-output benchmark datasets

Closed this issue 4 years ago · 8 comments

Hi, I recently found this work on arXiv, which solves one important problem in traditional tree boosting system, and is pretty solid.

Here I would like to ask that has the proposed method been evaluated on benchmark datasets in multi-output regression? (such as those in http://mulan.sourceforge.net/datasets-mtr.html)

I attempted to do this, but found tuning hyper-parameters a bit tricky...Thanks :)

Answer 1 · 2020-04-29T09:51:38.000Z

Thanks for your interest. I am sorry that our method has not been evaluated on it.

Based on my experiments, the following tips might be useful for regression tasks:

Minimum samples on a leaf should be small.
If the dataset is small (number of samples is small), it is better to use the exact split algorithm instead of the histogram approximation. Unfortunately, this is not supported in the current version of GBDTMO. But I will add this function ASAP.
Preprocess of output is necessary. For example, the dynamic range of different outputs should be (nearly) the same. Otherwise, the loss is unbalanced.

Answer 2 · 2020-04-29T12:36:15.000Z

Hi, thanks for your great work for GBDT.
Will your implement the idea on LightGBM(xgboost), which will be useful.

Answer 3 · 2020-04-30T04:15:16.000Z

@stanpcf
I would like to treat this as an independent project which extends the function of GBDT. Multiple outputs is only the first step. Now, I am trying to develop novel regularizations and multi-layer training of GBDT. After all of these finished, I will consider integrating this project into LightGBM or XGBoost.

However, developing these new algorithms may take a long time. So, I am afraid I can't integrate this project into LightGBM or XGBoost in the coming months.

Answer 4 · 2020-04-30T04:17:44.000Z

@stanpcf
Take a look at this repo (https://github.com/GBDT-PL/GBDT-PL), which might be useful.

Answer 5 · 2020-04-30T06:57:38.000Z

@zzd1992 there is really a gap between research and develop. thx for your reply.

@AaronX121 it's a good idea for gbdt to fit the output. I had dived in GBDT-PL for months and developed some algorithm such as ltr, dart etc on it. but I had eventually deprecated this repo because the repo have too much bugs to fix. I had developed GBDT-PL on other GBDT-Lib. I also find some research fit node with other algorithm such as svm.

Answer 6 · 2020-05-01T06:22:09.000Z

@stanpcf Gap between research and develop :)

Answer 7 · 2021-03-09T10:35:07.000Z

@xuyxu
Hi, recently I find a mistake in updating multi-output histograms. After correcting it, I find the convergence speed and performance are improved.

If previous results on mtr are not satisfactory, you may try it again to see what happens.

Answer 8 · 2021-03-09T10:36:52.000Z

Thanks for your kind reminder @zzd1992, I will take a look.