Edison Mucllari 2021 convolution neural network, resnet, lenet, image processing

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs)

What is a CNN?
A Convolutional Neural Network (CNN) is a specialized neural network architecture designed primarily for processing grid-like data such as images. CNNs use mathematical operations called convolutions instead of general matrix multiplication in at least one of their layers.

CNN Architecture Explained

Layer Type	Purpose	Operation
Convolutional Layer	Feature extraction from input	Applies filters to detect patterns like edges, textures, etc.
Activation Layer (ReLU)	Introduces non-linearity	$f(x) = max(0, x)$
Pooling Layer	Reduces spatial dimensions	Summarizes features (max or average) in a region
Fully Connected Layer	Classification based on features	Standard neural network connections

Key Advantages of CNNs:

Parameter Sharing: Same filter is applied across the entire image
Sparse Connectivity: Each output value depends only on a small local region
Translation Invariance: Can detect features regardless of their position
Hierarchical Feature Learning: Early layers learn simple features, deeper layers learn complex ones

The Convolution Operation Explained

The 2D Convolution:
$$ (I * K)(i, j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m, n) $$ where:

$I$ is the input (e.g., image)
$K$ is the kernel/filter
$i,j$ are the coordinates in the output
$m,n$ are the coordinates in the kernel

How Convolution Works:

A small filter (kernel) slides over the input
At each position, element-wise multiplication and sum are computed
The result forms one element in the output feature map
Different filters detect different features (horizontal edges, vertical edges, etc.)

Mathematical Details of CNN Components

1. Convolutional Layer
For a layer with input $X$ and kernel $W$, the output feature map $Y$ is:

$$ Y_{i,j,k} = \sum_{m=0}^{F_h-1} \sum_{n=0}^{F_w-1} \sum_{c=0}^{C_{in}-1} X_{i+m,j+n,c} \cdot W_{m,n,c,k} + b_k $$
where:

$F_h, F_w$ are filter height and width
$C_{in}$ is the number of input channels
$b_k$ is the bias for the $k$-th filter
$i,j$ are spatial coordinates in the output
$k$ is the index of the output channel

2. Pooling Layer
Max Pooling with a $2 \times 2$ window and stride 2:

$$ Y_{i,j,k} = \max_{0 \leq m,n < 2} X_{2i+m, 2j+n, k} $$
Average Pooling with a $2 \times 2$ window and stride 2:

$$ Y_{i,j,k} = \frac{1}{4} \sum_{m=0}^{1} \sum_{n=0}^{1} X_{2i+m, 2j+n, k} $$

3. Output Size Calculation
For a convolution with input size $I$, kernel size $K$, padding $P$, and stride $S$:

$$ \text{Output Size} = \left\lfloor \frac{I - K + 2P}{S} + 1 \right\rfloor $$
where $\lfloor \rfloor$ denotes the floor operation.

Why CNNs Work: The Theory

Key Theoretical Principles:

Sparse Connectivity: Each neuron connects to only a small region, reducing parameters
Parameter Sharing: Same weights used across different positions, enforcing translation equivariance
Equivariance to Translation: If input shifts, output shifts by the same amount

Mathematical Proof of Translation Equivariance
Let's define a translation operator $T_u$ that shifts an input by $u$ pixels:

$$ [T_u(I)](x) = I(x-u) $$
For a convolution operation $f(I) = I * K$, we can prove:

$$ \begin{align} f(T_u(I))(x) &= [T_u(I) * K](x) \\ &= \sum_y T_u(I)(y) \cdot K(x-y) \\ &= \sum_y I(y-u) \cdot K(x-y) \\ \end{align} $$
With substitution $z = y-u$:

$$ \begin{align} f(T_u(I))(x) &= \sum_z I(z) \cdot K(x-(z+u)) \\ &= \sum_z I(z) \cdot K((x-u)-z) \\ &= [I * K](x-u) \\ &= T_u(f(I))(x) \end{align} $$
This proves that $f(T_u(I)) = T_u(f(I))$, demonstrating translation equivariance.

Famous CNN Architectures

Architecture	Year	Key Innovation	Performance
LeNet-5	1998	First successful CNN for digit recognition	~99% on MNIST
AlexNet	2012	ReLU, Dropout, GPU implementation	Top-5 error: 15.3% on ImageNet
VGGNet	2014	Small filters (3×3), deeper network	Top-5 error: 7.3% on ImageNet
ResNet	2015	Skip connections to solve vanishing gradient	Top-5 error: 3.57% on ImageNet

ResNet's Skip Connection:

One of the most important innovations in CNN architecture was the residual block in ResNet, which can be expressed as:

$$ y = F(x, \{W_i\}) + x $$

where $F$ is a residual mapping to be learned and $x$ is the identity mapping (skip connection).

Backpropagation in CNNs

Gradient Computation for Convolutional Layers
For a convolutional layer with weights $W$ and input $X$, the gradient of the loss $L$ with respect to $W$ is:

$$ \frac{\partial L}{\partial W_{m,n,c,k}} = \sum_{i,j} \frac{\partial L}{\partial Y_{i,j,k}} \cdot X_{i+m,j+n,c} $$
And the gradient with respect to the input is:

$$ \frac{\partial L}{\partial X_{i,j,c}} = \sum_{k} \sum_{m,n} \frac{\partial L}{\partial Y_{i-m,j-n,k}} \cdot W_{m,n,c,k} $$

Why CNNs Excel at Image Processing

Biological Inspiration:

CNNs are inspired by the visual cortex in animals. The receptive fields of neurons in the visual cortex are localized, and different neurons respond to visual stimuli in restricted regions. Similarly, CNN filters respond to local patterns in their receptive fields.

Hierarchical Feature Learning:

Early layers: Detect simple features like edges and corners
Middle layers: Combine simple features into more complex patterns like textures
Deep layers: Recognize high-level concepts like objects and faces

This hierarchical structure allows CNNs to build complex representations from simple building blocks.

Practical Considerations

Common Challenges:

Overfitting: Especially with limited data
Computational Cost: Training deep CNNs requires significant resources
Vanishing/Exploding Gradients: Can make training very deep networks difficult

Best Practices:

Data Augmentation: Artificially increase training data by applying transformations
Batch Normalization: Normalize activations to stabilize training
Transfer Learning: Use pre-trained models and fine-tune on specific tasks
Regularization: Apply dropout or L2 regularization to prevent overfitting

References

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

STYLE SWITCHER

Home

Research

Portfolio

Hometown

Blog

my blog

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs)

CNN Architecture Explained

The Convolution Operation Explained

Mathematical Details of CNN Components

Why CNNs Work: The Theory

Famous CNN Architectures

Backpropagation in CNNs

Why CNNs Excel at Image Processing

Practical Considerations

References