Questions about tree tuning

Question

Questions about tree tuning

luxin-tian opened this issue 5 years ago · 3 comments

Parameter distributions
In the problems set, we are required to tune the parameters of a Decision Tree and a Random Forest regression model. As is specified, the distributions of the parameters are set as the following,

In 1.(b)

from scipy.stats import randint as sp_randint
param_dist1 = {'max_depth': [3, 10], # a list?
               'min_samples_split': sp_randint(2, 20), 
               'min_samples_leaf': sp_randint(2, 20)}

In 1.(c)

param_dist2 = {'n_estimators': [10, 200], # a list?
               'max_depth': [3, 10],  # a list?
               'min_samples_split': sp_randint(2, 20), 
               'min_samples_leaf': sp_randint(2, 20), 
               'max_features': sp_randint(1, 5)}

While sp_randint is used for the other parameters, the distribution of max_depth and n_estimators are only specified by a list, which, according to the documentation of RandomizedSearchCV and GridSearchCV, means that only two numbers will be tried.

I wonder if this is intended, or a more reasonable specification would be sp_randint(int, int)?

test MSE

In 1.(b) and 1.(c), the problem description writes that

scoring='neg mean squared error' will allow you to compare the MSE of the optimized tree (it will output the negative MSE) to the MSE calculated in part (a).

However, in 1.(a), we are calculating the test MSE on the testing set, and when we tune the trees in (b) and (c), the best_score_ method returns the MSE calculated on the training set. I wonder if it may be the case that we cannot evaluate the performance of the tuning by comparing the MSE calculated on two different subsets of data?

Thank you!

Answer 1 · 2020-02-29T16:17:06.000Z

Hi @luxin-tian

For the randomized search CV, you can either give a list or a distribution as the parameter domain to search in. If it is a list, sampling is done without replacement and if it is a distribution sampling is done with replacement. In my understanding, using a distribution is most effective especially when you have continuous variables and you don't want to specify the entire list of possible values and/or when you know the probability distribution of the parameters. The difference between sp_randint and a list is that while every value of the list will be tried out, all possible values from sp_randint may not be tried. At the end of the day, either of the two methods can help you find the optimum and the method used is up to your discretion.
I agree that MSE in the two cases are performed in different data sets (one is cross validated and the other is on the test set). But comparison should be okay because they're of the same units. @rickecon : any thoughts?

Answer 2 · 2020-03-03T15:02:52.000Z

@luxin-tian @keertanavc . Yes, @keertanavc is right about how the parameter options settings work. And yes, the MSE in 1b and 1c are cross validated, while 1a is not. That is the point, to see the comparison of those values.

Answer 3 · 2020-03-05T16:04:21.000Z

Dear Keertana @keertanavc , I am curious about your saying that: 'If it is a list, sampling is done without replacement and if it is a distribution sampling is done with replacement.' Suppose we have a list with a length of 101, and a parameter of n_iter = 100. In this case, how can randomized search CV ensure that every value of the list will be tried out? Besides, if it is sampling without replacement for list, I think it will leave some combinations untried. How do you resolve the problem? Thanks!