The main.ipynb
notebook is performing a grid search to find the best hyperparameters for a Random Forest classifier. Here's a step-by-step explanation:
-
Define parameters grid for Random Forest: A dictionary
param_grid
is defined with possible values for the hyperparametersmin_samples_split
,min_samples_leaf
, andn_estimators
. These hyperparameters control the behavior of the Random Forest classifier. -
Initialize Random Forest classifier: A
RandomForestClassifier
objectrf_classifier
is created withn_jobs=20
, which means that the classifier will use 20 CPU cores for parallel computation. -
Initialize Grid Search with 5-fold cross-validation: A
GridSearchCV
objectgrid_search
is created with the Random Forest classifier, the parameters grid, and a 3-fold cross-validation. The scoring metric isf1_micro
, which is the F1 score computed globally by counting the total true positives, false negatives, and false positives. Theverbose=3
parameter means that the grid search will output detailed information about its progress. -
Fit the grid search to the data: The
grid_search
object is fitted to the dataX
and the targety
. This will perform the grid search, which involves training a Random Forest classifier for each combination of hyperparameters in the grid and evaluating its performance using 3-fold cross-validation. -
Print the best parameters found: The best combination of hyperparameters found by the grid search is printed.
-
Print the best mean cross-validated score found: The mean cross-validated score of the best estimator found by the grid search is printed. This is the average F1 score over the 3 folds of the cross-validation for the best estimator.
-
Write the best parameters and score to a log file: The best parameters and score are written to a log file
logs/logs.txt
. This is useful for keeping a record of the results of the grid search.