Calculate Receptive Field for VGG16

Calculate Receptive Field for Vgg-16 and potentially other nerual networks. Python code is provided along with a clear graph explaining the vgg16 structure.

vgg16

What is Receptive Field and How do we calculate?

BackGround Quote: (feel free to skip if you don’t want to read)

To understand how components in a CNN depend on components in the layers before it, and in particular on components of the input. Since CNNs can incorporate blocks that perform complex operations …, this information is generally available only at “run time” and cannot be uniquely determined given only the structure of the network. Furthermore, blocks can implement complex operations that are difficult to characterise in simple terms. Therefore, the analysis will be necessarily limited in scope.

We consider blocks such as convolutions for which one can deterministically establish
dependency chains between network components. We also assume that all the inputs $x$ and outputs $y$ are in the usual form of spatial maps, and therefore indexed as $x_{i,j,d,k}$ where i, j are spatial coordinates.

Consider a layer y = f(x). We are interested in establishing which components of x
influence which components of y. We also assume that this relation can be expressed in terms of a sliding rectangular window field, called receptive field. This means that the output component yi′′,j′′ depends only on the input components xi,j where (i, j) ∈ Ω(i′′, j′′) (note that feature channels are implicitly coalesced in this discussion). The set Ω(i′′,j′′) is a rectangle defined as follows:

$$ i \in \alpha(i’’ - 1) + \beta_h + [-\frac{\delta_h - 1}{2}, \frac{\delta_h - 1}{2}] $$

$$ j \in \alpha(j’’ - 1) + \beta_v + [-\frac{\delta_v - 1}{2}, \frac{\delta_v - 1}{2}] $$

where (αh,αv) is the stride, (βh,βv) the offset, and (∆h,∆v) the receptive field size.

—— Matconvnet Manual 5.1

From my understanding, receptive field is a way of measuring network compoents dependency, i.e. A block on current map is determined by how many blocks from a previous layer?

To calculate this, we need a set of parameters for each layer: filter_size $k$, stride $s$, offset(padding) $p$, and calculate the compound parameter when layers are stacked together. (* Note that I use different variable names from the manual.)

For a simple filter, filter_size $k$, stride $s$, and left padding $p$ are straight-forward

Many blocks in the neural network (e.g. max pooling, LNR, ReLU, most loss functions etc.), they have a filter-like receptive field geometry.

  • ReLU can be considered a 1 × 1 filter, in this case $k =1, s =1$ and $p=0$.
  • 2x2 Max Pooling can be considered a 2 × 2 filter, such that $k=s=2$. But padding strategy varies for different libraries. Even when padding is zero, some libraries will get out of range to make sure every block is counted. Luckily for vgg16 this is not a problem to worry about.
  • Flatten and FCL is not a filter. But for computational simplicity, it can be considered a filter where the filter size is the value of size of input. Since FCL is ususally the last step, so this definition will not cause problems later on, but will only work for receptive calculation.

We can proceed to work on the math equations for composition of layers. Some details were left out for simplicity.

Math

Along one dimension, given filter_size $k​$, stride $s​$, left padding $p^-$, and a sample point $i_o​$,

the range of points in the input field that affect $i_o$are:

$i_{in} \in s*i_o - p + [0, k-1] $

$= s*i_o + (\frac{k-1}{2} - p^-) + [-\frac{k-1}{2}, \frac{k-1}{2}] $

$= s*i_o + p’ + [-\frac{k-1}{2}, \frac{k-1}{2}]$

where we define $p = \frac{k-1}{2} - p^-$ as the offset, and the range is inclusive. The calculation is the same for each dimension.

Side Note – Dimension

To get the dimension of next map, the equation would be to max $h$ such that the maximum range this equation can reach in the inmap is smaller than the actually length of inmap. We can write this into equation as index_max - index_min + 1 <= length_of_inmap.

$$s*(h-1) - p^- + k-1 - (0 - p^- + 0) + 1 \leq n + p^- + p^+$$

If padding is symmetrical, $P=p^- = p^+$, then we get:

$$h \leq \frac{n+2P-k}{s} + 1 = [\frac{n+2P-k}{s}] + 1$$

Note: The manual is written for matlab, coordinates starts at 1 for the equation. I re-wrote the equations for convenience so that the coordinates start at 0. This does not affect the calculation of receptive fields.

Composing receptive fields

To calculate the combination of two layers: $k_0, s_0$ and $k_1, s_1$, we can use:

$$i_0 = s_0 ( i_1 - 1) + p_0 \pm \frac{k_0 - 1}{2}$$

$$i_1 = s_1 (i_2 - 1) + p_0 \pm \frac{k_1-1}{2}$$

Replace the $i_1$ with the second equation, we get:

$i_0 = s_0 ( s_1 (i_2 - 1) + p_1 \pm \frac{k_1-1}{2} - 1)+p_0 \pm \frac{k_0-1}{2} $

$ = s_0 s_1(i_2-1)+(s_0(p_1-1) + p_0) \pm\frac{(s_0(k_1-1) + k_0)-1}{2} $

From the result we can deduct that the compountd k’ and s’ would be:

$s’ = s_0s_1$

$k’ = s_0(k_1-1)+k_0$

$p’ = s_0(p_1-1)+p_0$

Overlaying receptive fields

To calculate overlaying receptive fields as the ones in googlenet, see details in Chapter 5.6.

Automation (Code)

It’s tedious to calculate all 16 layers by hand, since we already have the straight-forward equation, the best way is to write some code for it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
vgg_16 = [
# 1
[3, 1], [3, 1], [2, 2],
# 2
[3, 1], [3, 1], [2, 2],
# 3
[3, 1], [3, 1], [3, 1], [2, 2],
# 4
[3, 1], [3, 1], [3, 1], [2, 2],
# 5
[3, 1], [3, 1], [3, 1], [2, 2],
# fc6, fake convolutional layer
[7, 1]
]
vgg16_layers = [
"3x3 conv 64", "3x3 conv 64", "pool1",
"3x3 conv 128", "3x3 conv 128", "pool2",
"3x3 conv 256", "3x3 conv 256", "3x3 conv 256", "pool3",
"3x3 conv 512", "3x3 conv 512", "3x3 conv 512", "pool4",
"3x3 conv 512", "3x3 conv 512", "3x3 conv 512", "pool5",
"7x7 fc"
]
def cal_receptive_field(kspairs, layers=None):
# K: composed kernel, also the receptive field
# S: composed stride
K, S = 1, 1
# H = 224
if not layers:
layers = range(len(kspairs))
for layer, kspair in zip(layers, kspairs):
k, s = kspair
K = (k-1) * S + K
S = S * s
# H = H//s
# iamge size {0}'.format(H)

print('layer {:<15}: {} [{:3},{:2}]'.format(layer, kspair, K, S))

cal_receptive_field(vgg_16, vgg16_layers)

Note that we omit offset p for simplicity.

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
layer 3x3 conv 64    : [3, 1] [  3, 1]
layer 3x3 conv 64 : [3, 1] [ 5, 1]
layer pool1 : [2, 2] [ 6, 2]
layer 3x3 conv 128 : [3, 1] [ 10, 2]
layer 3x3 conv 128 : [3, 1] [ 14, 2]
layer pool2 : [2, 2] [ 16, 4]
layer 3x3 conv 256 : [3, 1] [ 24, 4]
layer 3x3 conv 256 : [3, 1] [ 32, 4]
layer 3x3 conv 256 : [3, 1] [ 40, 4]
layer pool3 : [2, 2] [ 44, 8]
layer 3x3 conv 512 : [3, 1] [ 60, 8]
layer 3x3 conv 512 : [3, 1] [ 76, 8]
layer 3x3 conv 512 : [3, 1] [ 92, 8]
layer pool4 : [2, 2] [100,16]
layer 3x3 conv 512 : [3, 1] [132,16]
layer 3x3 conv 512 : [3, 1] [164,16]
layer 3x3 conv 512 : [3, 1] [196,16]
layer pool5 : [2, 2] [212,32]
layer 7x7 fc : [7, 1] [404,32]

Thanks for the brilliant inspiration from ih4cku!

Results

vgg16

References:

Reddit Post

matconvnet manual chapter 5