[IDEA] Chapter 2, Add code demonstrating HalvingRandomSearchCV

Question

[IDEA] Chapter 2, Add code demonstrating HalvingRandomSearchCV

eranr opened this issue a year ago · 0 comments

Notebook name: 02_end_to_end_machine_learning_project
Section 5.2 “Randomized Search”
Cell 152

This is the first cell in the section, and it contains only the HalvingRandomSearchCV import. It seems like the cell is out of order and should contain actual code using the HalvingRandomSearchCV class.
How about adding the following two cells after the RandomizedSearchCV cells:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
param_distribs = {'preprocessing__geo__n_clusters': randint(low=3, high=50),
                  'random_forest__max_features': randint(low=2, high=20)}

h_rnd_search = HalvingRandomSearchCV(
    full_pipeline, param_distributions=param_distribs, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42)

h_rnd_search.fit(housing, housing_labels)

cv_res = pd.DataFrame(h_rnd_search.cv_results_).dropna()
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)
cv_res.head()

A couple of notes:

The execution of the first cell would generate a lot of warnings as e.g. the resource savings in the form of reducing the training set may not be adequate for the candidate being tested. One example I ran into was inside the KMeans fit function, where the number of clusters exceeded the number of training set data points.
The second cell is identical to previous cells that display the search results, only with the “dropna()” at the end. Whenever there is an error trying to fit a candidate as described above, the associated score appearing in the results are “nan”, leading the attempt to round the numerical results to fail.