Neural Network Starter Kit: Part 2
Let’s dig a little deeper into tuning hyperparameters!
Last week, we did a beginner-level overview of how to construct a neural network with Tensorflow. This week, we’re going to go over various methods of optimizing the neural network with some good old fashioned hyperparamter tuning. If you need a quick refresher on last week’s ‘Neural Network Starter Kit, Part 1’, click here.
What is hyperparamter tuning?
Hyperparameter tuning can be defined as the process of choosing the optimal set of hyperparameters for a particular machine learning algorithm. In the case of deep learning and deep neural networks, it becomes a way of trying to achieve that delicate balance between overfitting and underfitting by coming up with the right combination of activation functions, optimizers, neurons, etc. Hyperparameter tuning is an EXTREMELY iterative process and you can expect to spend time doing a great deal of trial and error before coming up with the optimal architecture.
Let’s start off with a bit of a vocabulary lesson before jumping into the code. The following terms will be the hyperparameters that we’ll be focusing on today.
Units: A unit is a Tensorflow specific term for the number of neurons or nodes in a given layer.
Activation Function: The simplest way of defining an activation function is that it is a function applied to a layer of a neural network that decides what information from a layer will be fired to the neurons of the next layer. Popular activation functions include tanh, relu, leaky relu, and sigmoid…though many others exist. Check out the tensorflow documentation for a full list.
Optimizer: Optimizers are found in the compile layer of a neural network and are pretty much exactly what they sound like. An optimizer’s role is to alter things like the weights and learning rate in order to minimize loss. Popular optimizers include Adam, RMSProp, and Adagrad. Check out the tensorflow documentation for a full list.
Loss Function: If you’re familiar with loss functions in calculus, then you’re good to go here. For those of us who are lacking a bit in the math department, a loss function is an algorithm that is used to optimize the parameter values and more or less tells you how well the model is performing. Popular loss functions are binary_crossentropy, categorical_crossentropy, mse (mean squared error) and poisson. Check out the tensorflow documentation for a full list.
Batch Size: The batch size determines the number of samples the model uses per iteration. This hyperparameter can greatly affect the performance of the model.
Epochs: An epoch is the number of times that the model will go through the entire training set. In basic terms, an epoch is one cycle through the training data.
If any of these words still don’t make sense to you, don’t worry…they will soon.
Hyperparameter Tuning in Action
We’ll work our way through the example model from last week and make decisions on the hyperparameters as we go along. First, let’s get our imports going.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
Next, we can instantiate our model.
model = Sequential()
Now, let’s add the input layer. Within this layer, we’ll be establishing units, input_dim (just the shape of our training data), and an activation function. The structure goes like this:
model.add(Dense(UNITS, input_dim = X_train.shape, ACTIVATION FUNCTION)
As I mentioned earlier, this is an extremely iterative process and we can make a few educated guesses here, but will most likely not arrive at the right combination on the first try. Let’s start off by giving the input layer 64 units and a tanh activation function.
model.add(Dense(units = 64, input_dim = X_train.shape, activation = 'tanh')
Next, we can add a few hidden layers to the model. The structure is similar, with the main difference being that we no longer have to specify an input_dim.
model.add(Dense(UNITS, ACTIVATION)
Typically, I like to start the architecture by making the units of the first hidden layer half as many as the input layer and then step down incrementally. For the hidden layers, I also like to start off by using the relu activation function.
model.add(units = 32, activation = 'relu')
model.add(Dense(units = 16, activation = 'relu'))
model.add(Dense(units = 8, activation = 'relu'))
The next step is to build the output layer. Our example is a binary classification, so the units should be set to 1 and the activation function should be sigmoid.
model.add(Dense(units = 1, activation = 'sigmoid')
Next, we build the compile layer. The hyperparameters for this layer are optimizer, loss, and metrics. Metrics are literally just how we want the performance of the model to be measured. We can also pass in a list of several metrics if you want it scored by more than one. The structure of this layer is:
model.compile(OPTIMIZER, LOSS, METRICS)
For this particular model, I’m going to start with the Adam optimizer. The optimizer greatly affects the performance of the model and this is a hyperparameter with which you should do quite a bit of experimentation. With the binary classification model, we’ll use binary_crossentropy and for fun we’ll go with accuracy as our metric.
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy')
Coming up next is fitting the model. We typically assign the fit to a variable and it consist of X_train, y_train, batch_size, and epochs. Batch size is another hyperparameter that can greatly affect performance and should also be experimented with.
history = model.fit(X_train, y_train, batch_size = 16, epochs = 25)
Okay, now let’s put that all together and see what we have!
# Create base model
model = Sequential()# Input layer
model.add(Dense(64, input_shape = X_train_df.shape, activation = 'relu'))# Hidden layers
model.add(Dense(32, activation = 'relu'))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(8, activation = 'relu'))# Output layer
model.add(Dense(1, activation = 'sigmoid'))# Compile
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy')# Fit the model
history = model.fit(X_train_df, y_train, batch_size = 16, epochs = 25)
Awesome! Looking good so far. Now, chances are, if we run this model as-is, our results won’t be perfect. As I’ve said probably too many times at this point, this is an iterative process that requires much experimentation. Feel free to experiment with different combinations of the hyperparameters that we discussed here today and see what the optimal architecture is for your data. Beyond just the hyperparameters, you can also experiment with adding layers or adding layers.
In Our Next Edition…
We’ve journeyed a little deeper into neural networks this week, but we’re still just scratching the surface of what we can do. In part 3 of this series, we’ll be diving deeper into deep learning and get into some more in depth methods of model optimization as well as introduce methods of fighting against overfitting. Stay tuned!