The Basics of Gridsearch

Jeff Spagnola
5 min readJan 23, 2021

--

Using GridsearchCV for Hyperparameter Tuning

Hyperparameter tuning of machine learning models can be the difference between an excellent performing model and…well…a model that just doesn’t work. While it can sometimes be “fun” to experiment with different combinations of hyperparameters, there must be a way to test many combinations of parameters at once, right? Right!

Time to use GridsearchCV from Scikit-learn.

What is GridsearchCV?

GridsearchCV is a tool that can be found within the Scitkit-learn model selection package. The scikit-learn documentation defines it as an “exhaustive search over specified parameter values for an estimator.” In simpler terms, this means that you can perform a gridsearch to test multiple combinations of hyperparameters in order to find the combination that performs the best according to whatever metric you want to determine performance. If you’re still confused by this, don’t worry…I’ll be explaining the process in more detail when we jump into the code. For now, just think of it as a tool that automates the process of choosing hyperparameters for a machine learning model.

How do I use GridsearchCV?

Okay! Time for the fun part. So the idea of automating hyperparameter choices for your model sounds awesome, right? Well, it’s also pretty easy to do. For the following examples, I’ll be coding in Python and using models from scikit-learn. Let’s jump in. First, let’s import that packages and set up

from sklearn.model_selection import GridsearchCV
from sklearn.ensemble import RandomForestClassifier

Once that’s all squared away, let’s take a look at what makes up the GridsearchCV tool. There’s a bunch of parameters within GridsearchCV, but I’ll be focusing on the ones that tend to be the most important.

Estimator

A quick look through the documentation shows a number of parameters within GridsearchCV, and the first of those is the estimator parameter. The estimator is just creating an instantiation of the model we want to use. In this example, I’ll be using the RandomForestClassifier so below is what the parameter would look like.

estimator = RandomForestClassifier()

Param_Grid

The real meat and potatoes of the GridsearchCV is the param_grid parameter. This is where we can specify the parameters that we want to experiment with as well as the values for those parameters. The param_grid takes a dictionary with the parameter as the key and a list of the values that you want to experiment with. Truthfully, there may still be some trial and error here when trying to come up with the range of values, but this definitely beats running your model over and over again and manually messing with the parameter combinations. Let’s see an example of a param_grid.

param_grid = {'n_estimators':[10, 100, 250, 500],
'criterion':['gini', 'entropy'],
'max_depth':[1, 5, 10, 25],
'min_samples_leaf':[1, 5, 10],
'min_samples_split':[2, 5, 10]
}

Scoring

This is an interesting one that spawns some debate across the corners of the internet. Scoring refers to the strategy to evaluate the performance of the cross-validated model. There are several ways to set the scoring parameter, but I tend to pass a string of the type of score of which I want the model to be optimized. For a complete list of values for this parameter, click here.

scoring = 'accuracy'

N_Jobs

Number of jobs to run in parallel. Setting n_jobs to -1 makes it so the model uses all processors on your machine.

n_jobs = -1

CV

The cv parameter stands for cross validation and determines how many folds the model uses to validate it’s results. The default for this paramater is cv = 5, but we want this to run a bit faster so we’ll set ours to 3.

cv = 3
I didn’t make this and the “their” hurts me too. Deeply.

Putting It All Together

Now that we have an idea of the main GridsearchCV parameters, let’s put it all together and see what the code should look like as well as how to run it, how to see our results, and how to programmatically use the results of the Gridsearch to fit our model.

Let’s see what all of the above code looks like in real life and let’s also fit the GridsearchCV.

# Set the param_grid params
params = {'n_estimators' : [10, 100, 250, 500],
'criterion': ['gini', 'entropy'],
'max_depth' : [1, 5, 10, 25],
'min_samples_split' : [2, 5, 10],
'min_samples_leaf' : [1, 5, 10]
}
# Set the gridsearch model
grid = GridSearchCV(estimator = RandomForestClassifier(),
param_grid = params,
scoring = 'accuracy',
cv = 3,
n_jobs = -1)
# Fit the gridsearch
grid.fit(X_train, Y_train)

Once we fit the GridSearchCV, now we can find our best parameters by using a few attributes: best_estimator_ and get_params().

print(grid.best_estimator_.get_params())# Output from this:
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 25, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 250, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Now from here, you can set the parameters of your model by just inputing the best parameters listed in the output above, but there’s a more programmatic way that we can fit the model.

best_rf_params = grid.best_params_
best_rf_model = RandomForestClassifier(**best_rf_params)
best_rf_model.fit(X_train, y_train)

Now let’s evaluate the model and see how we did.

def evaluate_model(model, X_test, y_test):
# Predictions
y_hat_test = model.predict(X_test)

# Classification Report
print(' Classification Report')
print('-------------------------------------------------------')
print(classification_report(y_test, y_hat_test))

# Confusion Matrix
fig, axes = plt.subplots(figsize = (12,6), ncols = 2)
plot_confusion_matrix(model, X_test, y_test, normalize = 'true',
cmap = 'Blues', ax = axes[0])
axes[0].set_title('Confusion Matrix');

# ROC-AUC Curve
roc_auc_plot = plot_roc_curve(model, X_test, y_test, ax = axes[1])
axes[1].legend()
axes[1].plot([0,1], [0,1], ls = ':')
axes[1].grid()
axes[1].set_title('ROC-AUC Plot')
fig.tight_layout()
plt.show();

# Evaluate model
evaluate_model(best_rf_model, X_test, y_test)

Now let’s compare this to just a base RandomForestClassifier model.

# Fit the base model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
# Evaluate the model
evaluate_model(rf, X_test, y_test)

As we can see, the model using GridSearchCV performed much better in terms of the overall accuracy. This was just a basic overview of GridSearchCV and really just scratches the surface on what you’re able to do. Feel free to take a deeper dive into the documentation to find out more.

--

--

Jeff Spagnola
Jeff Spagnola

Written by Jeff Spagnola

A mildly sarcastic, often enthusiastic Data Scientist based in central Florida. If you’ve come expecting blogs about machine learning, future science, space, AI

No responses yet