Loss functions in Deep Learning
While there are a shit ton of concepts related to Deep learning scrambled all around the internet, I thought why not have just one place where one can find all the fundamental concepts needed to set up their own Deep Neural Network (DNN) architecture? This series can be viewed as a reference guide that you can come back to and look at to brush up on everything. In this first part, I will discuss one of the most essential elements of deep learning - the loss function! I call it the "Oxygen of Deep Learning" because, without a loss function, a neural network cannot be trained (so it would just be dead).
A loss function also called an objective function or a cost function, shows us "how" bad our neural network's predictions are or quantify our unhappiness with scores (another word for predictions) across the training data. So lower the loss, the better our model is. An abstract formulation can be (on image classification task) as follows - Given an image $x_i$ from the training data of an image classification task and its label (or true class) $y_i$, we feed $x_i$ into our model to obtain scores for each class. In the picture below we feed an image of a cat, car, and frog into our model to generate scores as shown.
The scores of the true class are in bold. The class with the maximum score is the predicted label (a class to which the image belongs according to our model). Based on these scores, a loss is computed for each of the input image. The final loss is the average of the losses of all input images. We talk about two of the most commonly used loss functions in deep learning - SVM loss and Softmax loss.
SVM Loss
The intuition behind Support Vector Machine (SVM) loss is that the score of the correct class should be higher than all the other scores by some threshold value. This seems reasonable as we would want our classifier to assign a high score for the right category and low scores for all the other wrong categories. The SVM loss has the form, \begin{align} L_i = \sum_{j \neq y_i} \text{max}(0, \underbrace{s_j}_{\text{score of jth class}} - \underbrace{s_{y_i}}_{\text{score of true class}} + \underbrace{1}_{\text{margin}}) \end{align} Note that there is zero loss for predicting the true class.
The SVM loss is also called Hinge Loss because of its shape like a door hinge. If the score of the correct class is greater than all the other scores plus a margin, the loss is zero. Otherwise, the loss increases linearly.
Let's derive the SVM loss for our example. For the cat image, the true score $s_{y_i} = 3.2$. The loss is computed on the car and frog classes as, \begin{equation*} L_1 = \text{max}(0, 5.1 - 3.2 + 1) + \text{max}(0, -1.7 - 3.2 + 1) = 2.9 + 0 = 2.9 \end{equation*} Similarly for the other two training examples, \begin{align*} L_2 &= \text{max}(0, 1.3 - 4.9 + 1) + \text{max}(0, 2 - 4.9 + 1) = 0 \\ L_3 &= \text{max}(0, 2.2 + 3.1 + 1) + \text{max}(0, 2.5 + 3.1 + 1) = 6.3 + 6.6 = 12.9 \end{align*} The final loss of the model is the average of all losses: $L_{\text{svm}} = 5.26$.
Cross-Entropy Loss
Cross-Entropy loss or Logistic Loss interprets the raw classifier scores as probabilities. We take the raw scores and run them through the exponential function. This makes sure that all the probabilities are positive. We then normalize these probabilites to obtain a distribution over categories. This transformation is called the softmax function. Lets do this on our first cat image example.
The cross-entropy between a "true" distribution $p$ and an estimated distribution $q$ is defined as: \begin{equation*} H(p,q) = - \sum_x p(x) \text{ log} (q(x)) \end{equation*} The softmax function gives our estimated distribution $q$. The true or desired distribution is where all probability mass is on the correct class (i.e. $p = [0, .. 1, .. , 0]$ contains a single 1 at the $y_i$th position). Hence the cross-entropy would be the negative log of the probability of true class. For the above cat image, the loss is $L_1 = -log(0.13) = 2.04$. The cross-entropy loss has the form, \begin{equation} L_i = -\text{log} \underbrace{\left( \frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right)}_{\text{softmax function}} \end{equation}
While these are the two most commonly used loss functions, a complete list of all loss functions can be found here - Loss Functions
Comments
Post a Comment