Neural Networks: Activation functions and Weight Initialization

In this post, we will make choices for components of our deep neural network architecture, including activation functions and how the weights of each layer get initialized to ease the optimization process.

A neural network is composed of interconnected layers, with every neuron in one layer connecting to every neuron in the next layer. Such a fully connected neural network is often called Multi-layer Perceptron (MLP). Let's dive right into defining our deep neural network architecture.

\begin{align*} \text{Linear function: } & f = W x \\ \text{2-Layer Neural Network: } & f = W_2 \hspace{2mm} z(W_1 x) \\ \text{3-Layer Neural Network: } & f = W_3 \hspace{2mm} z_2(W_2 \hspace{2mm} z_1(W_1 x)) \\ \end{align*} The weights $W$ have 1s appended at the last column for the bias (called the bias trick). A typical deep learning model has many layers. The number of learnable weights is equal to the number of layers in the DNN (as each layer has an associated weight), and each layer is followed by an activation function $z$.

Activation functions are super critical to the functioning of a neural network as they introduce non-linearity into our model. Without them, stacking linear layers on top of one another would result in one big linear layer ($ y = W_2 W_1 x = W x $).

A neural network with one hidden layer is a universal approximator as it can fit any function. This is because the activation function makes complex data linearly separable in the transform (feature) space. More the number of layers (more activation functions), the more complicated the decision boundary in the input space giving rise to a perfect classification model (one that fits the training data perfectly).

Activation functions

How should one select an activation function? Here are some commonly used activation functions that one can choose from, but each one has its pros and cons,

Sigmoid - It has a nice interpretation of a probability by transforming the input data in the range [0,1], i.e., gives the probability of a particular feature being present. However, the output of the sigmoid saturates (i.e., the curve becomes parallel to the x-axis) for a large positive or large negative number. Thus, the gradient at these regions is almost zero (dw $\approx$ 0), killing the learning process.

This phenomenon is called the vanishing gradient problem where the gradients of the parameters with respect to the loss function become very small as the number of layers in the network increases. This can make it difficult for the network to learn, because the gradients are used to update the parameters during training, and if the gradients are very small, the updates will also be small and the learning process will be slow.

Furthermore, computing the exponential function is expensive, slowing our training process.

Tanh - It is shifted version of sigmoid that outputs in the range [-1,1]. It is slightly better than sigmoid as it is zero-centered, so it can produce positive and negative outputs. But it has all the other problems as sigmoid.

ReLU - Rectified Linear Unit (or ReLU) is most commonly used activation function and is often used to address the vanishing gradient problem because it has a non-saturating gradient. This means that the gradient of the ReLU function is always 1 for positive inputs and 0 for negative inputs, so the gradients do not vanish as the input increases, which can allow the network to learn more effectively. Training is also significantly faster using the ReLU function.

However, ReLU does have a potential issue known as the "dying ReLU" problem, where the activation function becomes stuck at 0 for negative inputs. This can occur if the weights are not initialized properly or if the learning rate is too high, causing the weights to never be updated for negative inputs. This can lead to neurons becoming "dead" and not contributing to the output of the network, which can degrade performance.

Leaky ReLU - It is an improved version of ReLU function to solve the dying ReLU problem as it has a small positive slope in the negative area. So even if the activations are negative, it will still receive smaller gradients and have the potential to keep learning. So it does not saturate in any region and is also computationally efficient. The value 0.1 is a hyperparameter that we need to set.

Other advanced ReLU options include Parametric RELU (PReLU), Exponential linear unit (ELU), Scaled ELU (SELU), and many more! To summarize,

Don't think too hard; just use ReLU!
Don't use sigmoid/tanh.
Try out advanced RELU options to push the accuracy by 0.1%.

Weight Initialization

Now that we have made decisions about the architecture of our network, let's see how we can initialize the weights of each layer of this network. Each time a neural network is initialized with a different set of weights, it results in a different starting point for the optimization process, potentially resulting in a different final set of weights with different performance characteristics.

What would happen if we initialize the weights to a contant value, say zero? Then the outputs would all be zero and would not depend on the input, giving zero gradients for all training examples (the network won't train).

One approach can be to initialize the weights with small random numbers (gaussian with zero mean, std = 0.01). This might work for small networks but multiplying small weights ($< 1$) recursively in a deep network would result in smaller activations that would produce very small gradients. Our optimization algorithm would take small steps with small gradients, taking a very long time to train (or converge).

If we initialize the weights with high values (std = 0.05) and an activation function like sigmoid/tanh is applied, the function maps its value near to 1, where the curve saturates, giving small gradients (called the vanishing gradients problem). If we use an activation function like ReLU, higher values give rise of exploding gradients problem. We need some way of initializing weights that are not too small or too big (find a sweet spot).

Xavier Initialization - If we can make sure that the variance of the output of a layer is the same as the variance of the input (so that the scale of activations doesn't change), then we can solve the problem of vanishing/exploding gradients. Consider a linear layer, \begin{align*} y &= \sum_{i=1}^{D_{in}} x_i w_i \\ Var(y) &= D_{in} * Var(x_i w_i) \\ &= D_{in} * Var(x_i) * Var(w_i) \\ Var(y) = Var(x_i) & \Rightarrow Var(w_i) = 1/D_{in} \end{align*} Instead of using the standard deviation as a hyperparameter, we set it to the inverse square root of the input dimension of the layer. The weights of a layer $l$ are initialized as,

 w_l = torch.rand(dim_l, dim_{l-1})/torch.sqrt(dim_{l-1})

This derivation depends on the choice of activation function as the linear layer would be followed by it. This will work if we use a tanh function that is symmetric around zero and matches the variance of the linear layer. However, with ReLU, negative activations are zeroed out, outputting smaller values, violating our motivation of keeping the variance constant. Xavier initialization assumes a zero-centered function and won't work with a ReLU activation function.

Kaiming/ He Initialization - Since ReLU takes away activations that are less than zero, which is half of the expected outputs (since they are uniformly distributed around zero), all we have to do is double the variance to keep the variance constant through a layer. Therefore, we set $Var(w_i) = 2/D_{in}$.

 w_l = torch.rand(dim_l, dim_{l-1}) * torch.sqrt(2/dim_{l-1})

While weight initialization is an active research area, the above schemes work well in practice.

Search This Blog

YA's Blog