Convolutional Neural Networks

Convolutional Neural networks are one of the types of deep learning. Convolutional neural networks (CNNs) are particularly useful for the classification and recognition of images, such as detecting different objects and faces in photos. CNN receives an image as an input, recognizes various objects, and classifies them such as humans, animals, vehicles, buildings, etc. A Computer reads a digital image as an array of pixels, and each pixel represents a specific value of a color. In a CNN, an input image passes through several layers of deep learning structure. The layers of CNN include convolution layers, pooling layers, and fully connected layers.

Convolution Layers

In the convolution layer, an image is divided into pieces, and the features are extracted from each smaller part. The convolution layer keeps the input image pixels intact while learning features from those smaller pieces. Convolutional layers filter the smaller chunks of pixel information using kernels. The image is in the form of the matrix, and the filter is also a matrix. Each image matrix has a certain number of rows and columns. The number of rows and columns of the filter matrix is smaller than the image matrix. The filter matrix is multiplied by the image matrix several times at different locations, depending on the number of strides. If the stride is one, then the filter will move 1 pixel after each multiplication; if the stride is two, then the filter will move 2 pixels, and so on. If a filter matrix does not fit evenly in the pixel matrix, extra zeros are added to the pixel matrix. This process is called zero padding. Another way to make the filter fit into an image is to remove the additional nonessential part of the mage. The multiplication of the image and filter matrix results in a feature matrix with the same dimension as the filter matrix. Different filter matrices can be used to obtain different results, for example, sharpening or blurring an image and detecting edges.

ReLU Function

Rectified Linear Unit (ReLu) is an activation function. Its output is unchanged if the input is positive, and if the input is negative, then the output is zero. ReLUs adds nonlinearity to CNNs. It prevents the pixel values to have negative values. Sigmoid is also used for this purpose, but ReLU is preferred due to its performance.

Pooling Layer

Pooling layers take a very large image and decrease the number of parameters. It is also called down-sampling. In this process, the dimension of the feature map is reduced while retaining the important feature information. Pooling has many types; for example, max-pooling selects the largest feature matrix element, then the average pooling takes average elements of all feature matrices, and sum pooling takes the sum of all elements in the feature matrix.

Fully Connected Layer

After the pooling layer, the image is organized into a vector. The vector is then fed to a fully connected neural network. Activation functions like SoftMax or sigmoid are used to classify the detected objects by assigning probabilities between 0 and 1. For example, a cat’s image is fed to a CNN; if CNN detects the cat successfully, it assigns the cat category a probability closer to 1. It assigns the dog category a probability closer to zero.

Other useful articles: