VGGNet: Very Deep Convolutional Networks

With the advent of AlexNet, all the submissions to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) switched over to using convolutional neural networks. In 2013, the winner of this challenge was ZFNet, a modified version of AlexNet which gave better accuracy. It was also an 8-layer network that tweaked some of the layer configurations of AlexNet by trial and error.

ZFNet used 7 $\times$ 7 sized filters in the first layer with a stride of 2 instead of 11 $\times$ 11 filters with a stride of 4. The intuition behind this is that we were losing a lot of pixel information (aggressively downsampling the input), which can be retained by having smaller filter sizes and small strides. The padding is removed for the first two conv layers (to match the subsequent conv layers shapes of AlexNet). With just these two changes, they are able to achieve a reasonably large increase in performance over AlexNet.

AlexNet or ZFNet were designed in somewhat an ad-hoc manner with some arbitrary number of convolution layers, and pooling layers and the configurations of each layer were set by trial and error. This makes it very hard to scale them.

In 2014, the 2nd place winner of this challenge was VGGNet, which was one of the first architectures that had a principled design throughout to guide the overall configuration of the network. With such a method, they were able to create deeper networks (16-19 layers!) and achieve significant improvement over the prior configurations.


Architecture: VGGNet has very clean and simple design principles where the configuration of each layer is fixed as,

  • All convolution layers are fixed to have a kernel size of 3 $\times$ 3 with a stride 1 and pad 1.
  • All max-pooling layers have a size of 2 $\times$ 2 with stride 2.
  • After each pooling layer, double the number of channels.

While AlexNet had 5 convolutional layers, VGGNet has 5 stages,

Stage 1: conv-conv-pool
Stage 2: conv-conv-pool
Stage 3: conv-conv-pool
Stage 4: conv-conv-conv-[conv]-pool
Stage 5: conv-conv-conv-[conv]-pool

The authors presented two variants, VGG-16 and VGG-19 with 16 and 19 layers respectively. VGG-19 has 4 convolutions in stages 4 and 5.

Inspirations from AlexNet: VGGNet uses ReLU non-linearities; the stack of convolution layers is followed by 3 fully-connected layers, with a softmax layer at the end; Dropout layers are added after the first two fully-connected layers (dropout ratio set to 0.5).


A 3 $\times$ 3 convolutions is used as it is the smallest size to capture the notion of left/right, up/down, and center. A stack of two such 3 $\times$ 3 conv layers have an effective receptive field of 5 $\times$ 5, and three such layers that a 7 $\times$ 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3 $\times$ 3 conv layers instead of a single 7×7 layer?

  1. Three non-linearities instead of one - makes the decision function more discriminative
  2. Decrease in the number of parameters - For input and output C channels, three 3 $\times$ 3 would have $3(3 * 3 * C^2) = 27C^2$ params, whereas a 7 $\times$ 7 layers woudl have $7 * 7 * C^2 = 49C^2$ params.

A padding of 1 is used so that the spatial dimension is preserved after convolution('same' padding).

No kind of normalization is used in the network as it does not improve the performance on the dataset but instead leads to increased memory consumption and computation time.



Training: The training hyperparameters and choices follows AlexNet. The model was trained on the Cross-entropy loss function using mini-batch gradient descent with a batch size of 256 and momentum of 0.9. The training was regularized with an L2 weight decay of 0.0005.

The learning rate was initialized at 0.01 and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 74 epochs.

In spite of a larger number of parameters and the greater depth as compared to AlexNet, VGGNet required fewer epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv filter sizes, (b) better initialization of weights and biases - they used Xavier initialization.


Overall VGGNet highlighted the importance of depth in achieving better performance on the image classification task. Important design choices of the paper are,

  • 5 Stages: Each conv 3 $\times$ 3, s=1, p=1; Each max-pool 2 $\times$ 2, s = 2
  • Number of channels: 64, 128, 256, 512
  • No Normalization used
  • Activation Function: ReLU
  • Data pre-processing: subtract per-channel mean (mean RGB value from each pixel)
  • Weight Initialization: Xavier
  • Regularization: L2 Weight decay ($\lambda$ = 5e-4), Dropout (p=0.5)
  • Learning rate: 0.01 and reduced (divided by 10) when val accuracy plateaus
  • Optimization Method: SGD + Momentum (m=0.9) with batch size of 256
  • Loss function: Cross-Entropy loss

Link to the Papers: Visualizing and Understanding Convolutional Networks , Very Deep Convolutional Networks for Large-Scale Image Recognition


Trivia: VGGNet is named after the Visual Geometry Group at Oxford where a grad student (Karen Simonyan) and a faculty member (Andrew Zisserman) came up with this idea and managed to come very close to GoogLeNet's performance which was built by the whole team at Google with access to lots of resources!

Comments

Popular posts from this blog

The move_base ROS node

Three Wheeled Omnidirectional Robot : Motion Analysis

Overview of ATmega328P