Implementing Backpropagation
While we have talked about how optimization algorithms use the negative gradient of the loss function with respect to the weights to update the parameters of our model, we will now focus on how to actually compute these gradients on an arbitrary loss function. We use a backpropagation technique that creates a computation graph to perform a forward and backward pass on the model. In the forward pass, we compute outputs of each layer of our neural network sequentially. In the backward pass, we apply the chain rule for computing derivatives of each individual layer backward.
Consider the example of a computation graph shown below.
First we perform a forward pass where we have the value of inputs $x = -2, y = 5, z = -4$, \begin{align*} q &= x + y = -2 + 5 = -3\\ f & = z * q = -4 * -3 = -12 \end{align*} Now lets compute the backward pass, \begin{align*} & \frac{df}{df} = 1, \hspace{5mm} \frac{df}{dq} = z = -4, \hspace{5mm} \frac{df}{dz} = q = -3 \\ & \frac{df}{dx} = \frac{df}{dq} \frac{dq}{dx} = -4 * 1 = -4, \hspace{5mm} \frac{df}{dy} = \frac{df}{dq} \frac{dq}{dy} = -4 * 1 = -4 \\ \end{align*} This way, we have computed the derivative of the output with respect to the input. These gradients involved in chain rule have names, \begin{align*} \underbrace{\frac{df}{dy}}_{\text{downstream gradient}} = \underbrace{\frac{df}{dq}}_{\text{upstream gradient}} \hspace{2mm} \underbrace{\frac{dq}{dy}}_{\text{local gradient}} \end{align*}
With a much more complicated neural network model, we follow the same procedure but don't worry, you don't have to compute the chain rule manually; Pytorch comes to the rescue!
Consider this simple neural network model on which we will apply backpropagation. It an image classification task with each image of shape (32 X 32 X 3 = 3072) input dimensions. The model with two layers outputs scores for 10 classes as shown below. [Linear -> ReLU -> Linear]
Pytorch uses the autograd package to build computational graphs out of Tensors and automatically compute gradients. Initializing weights as tensors with requires_grad = True
flag enables autograd. Once we compute the loss, the gradient of this loss with respect to the weights can automatically be computed by running loss.backward()
. This function implements backpropagation on all the inputs that have require grad set and accumulates the gradients in .grad()
variable.
The torch.no_grad()
function makes sure a computation graph is not created on the operations within it. It is crucial to run .zero_grad()
function that zeroes out the current gradients to accumulate fresh gradients in the next step (as it does not overwrite automatically). A simple mini-batch gradient descent algorithm looks as follows,
w1 = torch.randn(3072, 100, requires_grad=True)
w2 = torch.randn(100, 10, requires_grad=True)
mini_batches = split(data, batch_size) # Split data into mini-batches
for t in range(num_steps):
for X_batch, y_batch in mini_batches:
y_pred = X_batch.mm(w1).clamp(min=0).mm(w2) # Forward Pass
loss = loss_fn(y_pred, y_batch)
loss.backward() # Compute gradients wrt loss
with torch.no_grad(): # Tells Pytorch not to build a graph
w1 -= learning_rate * w1.grad()
w2 -= learning_rate* w2.grad()
w1.zero_grad() # Set gradients to zero
w2.zero_grad()
X_batch
is the input vector (mini-batch) to our model and y_batch
is the corresponding label vector. y_pred
is the scores vector that our model outputs. The loss function returns the average loss over a mini-batch which is then used to compute the gradients. Pytorch provides the Dataloader
primitive to create mini-batches.
It was easy to define the model, initialize weights, and write the forward pass. But what if we had a neural network with 50 layers? Explicitly writing code for all this would be very cumbersome. Pytorch provides the nn.Sequential
API to define models easily. The weights have required grad set True implicitly and are initialized using the Kaiming initialization we discussed in the last post.
model = torch.nn.Sequential(torch.nn.Linear(3072, 100),
torch.nn.ReLU(),
torch.nn.Linear(100, 10))
mini_batches = torch.utils.data.DataLoader(data, batch_size=batch_size)
for t in range(num_steps):
for X_batch, y_batch in mini_batches:
y_pred = model(X_batch) # Forward Pass
loss = loss_fn(y_pred, y_batch)
loss.backward() # Compute gradients wrt loss
with torch.no_grad(): # Tells Pytorch not to build a graph
for param in model.parameters():
param -= learning_rate * param.grad()
model.zero_grad() # Set gradients to zero
It's annoying to write the weight update code all the time. Pytorch provides the optim
package (makes life easier once again!) for implementing the standard gradient descent rules that we have discussed. In our training loop, after computing gradients, we use this optimizer to automatically make the gradient step for us (one step of update) and zero out the gradients. This is what a typical training block looks like in practice,
model = torch.nn.Sequential(torch.nn.Linear(3072, 100),
torch.nn.ReLU(),
torch.nn.Linear(100, 10))
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mini_batches = torch.utils.data.DataLoader(data, batch_size=batch_size)
for t in range(num_steps):
for X_batch, y_batch in mini_batches:
y_pred = model(X_batch)
loss = loss_fn(y_pred, y_batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Comments
Post a Comment