Moore’s Law is a famous term in computing that describes how the number of transistors in integrated circuits doubles every two years, symbolising the rapid growth of the power of computers. Although industry experts predict that Moore’s Law will cease to apply in the near future, in the 55 years that it has existed, it has acted as the pacemaker of the entire technological world, the driving force to bring about revolutionary developments, and the fertile foundation for a fruitful future. The field of computing that this observation has had the most influence in is Artificial Intelligence, and more specifically, Machine Learning. Ever since Alan Turing’s introduction of the Turing Test, there has been breakthrough after breakthrough, and the seedling of AI has now germinated and bloomed into an entire field of flora. One especially beaming flower is Convolutional Neural Networks, or CNNs for short. CNNs, after being watered by copious amounts of data, are able to dominate the sub-fields of image classification, recommender systems and natural language processing, and have provided a strong basis for the next generations.
Neural networks, as the name suggests, mimic the neurons inside the brain, giving machines the ability to “learn”. When each neuron is given specific signals, they will start firing, and it is the interconnectedness and collaboration of the neurons that ultimately process information present in inputs and produce desired outputs. Yet the sheer complexity and the mysterious intricacies of the human brain means that we cannot just replicate it onto a computer program, but rather form a sort of model. The simplest of these models is the Feedforward Neural Network (FFNN). FFNNs consist of layers upon layers of neurons. When input data is fed into each neuron, it is manipulated by a function called an activation function, equipped with trainable variables, called weights and biases. The data that is then outputted is inputted into the next layer (the data is “fed forwards”), and the cycle continues until a result is produced in the output layer. It can be helpful to imagine each neuron being responsible for finding particular features, and the deeper the network, the more specific the features. Figure 1 is a model of an FFNN with one input layer, one output layer, and one hidden layer (meaning any layers in between).
If you were to simply substitute the weights and biases with random values, then you should expect random results, ultimately yielding an appalling accuracy and also a very high loss. The loss of a neural network is the key metric that determines how successful it is, as it usually combines different metrics together.
This is where training becomes necessary. Training neural networks consists of repeatedly altering weights and biases so that the neural network becomes as successful as possible on some sample images, known as the training data. One run through the training data is known as an epoch, and the training data is split into miniature-sized batches to reduce the burden on the computer’s memory. Making the neural network successful means decreasing the loss until it reaches a minimum. This may seem like systematic trial-and-error, but there are numerous techniques that can – with mathematical justification – decrease the final loss. Examples include using different activation functions, inserting different types of neurons, but also changing the entire neural network architecture.
A convolutional neural network acts like an FFNN but for multi-dimensional inputs. Rather than dealing with numbers in a single row, it can deal with greyscale images, coloured images, audio files, videos, and so much more, without having to flatten all the data into one dimension. Arguably one of the most common uses of CNNs today is for object recognition. Object recognition, sometimes called object classification, is determining what an object in an image is. For example, if you feed a CNN (trained to recognise fruits) an image of an orange, it should tell you that it is an orange. For us humans, this is one of the simplest tasks that the human brain can accomplish. We don’t even need to consciously think and we’ve already extracted information from the plethora of objects in our surroundings. Yet for CNNs, it is a task that requires arduous effort. Slight changes in position, colour, shade and even slight changes in pixels can completely change the decision of CNNs. Yet, the fatal flaw for CNNs is undoubtedly the training data; without tens of thousands of objects placed in various scenarios, without different lighting and sizes and no hidden bias whatsoever, a convolutional neural network cannot work perfectly. Below are some examples of incorrectly predicted images from a plain, unoptimised CNN trained with the machine learning library Keras in Python. It was trained on the German Traffic Sign Recognition Benchmark (GTSRB) with the goal of recognising traffic signs.
However, the reason that CNNs are still used heavily today (not only as a foundation) is because with a large and varied enough training dataset, and with the right optimisations and hyper-parameters, they can reach amazing accuracies. After introducing many more datasets and optimising the traffic sign CNN, I managed to increase its accuracy from 95% to almost 98%. This may not seem like a lot, but in the realms of autonomous driving, this actually is quite substantial1. Additionally, CNNs are shift invariant; this means that it doesn’t matter where the object is, the CNN will find it and try to recognise it.
Convolutional neural networks were first popularised in 2012, when Alex Krizhevsky overwhelmingly won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with his CNN, AlexNet. The ImageNet database of images is one of the largest open image databases in the world, currently with over 14 million images from 20k categories. The ILSVRC was a competition that tested the limits of object recognition neural networks, by introducing hundreds and eventually thousands of image classes to recognise from. Yet by 2017, 29 out of 38 teams were able to exceed 95% accuracy, all taking inspiration from the revolutionary AlexNet.
The main structure within CNNs are, as the name suggests, the numerous convolutional layers. However, without other layers, such as the downsampling and regularisation layers, there would be no way for the CNNs to achieve peak accuracy. CNNs used to be the absolute best of its kind because it analysed data in sections, utilising the fact that (especially in image recognition) the data values around one point are most probably related to it. This enabled feature spotting in areas and shift invariance. Figure 2 presents a diagram of how 2D a convolutional layer works.
The input data for each convolutional layer is a multi-dimensional array. However, to account for the higher dimensions, rather than using weights and biases like an FFNN layer, a smaller filter array (or kernel) is passed over all the values in the input array from left to right and top to bottom. As the filter encounters an area, a dot product is calculated, where values are multiplied element-by-element, and the result is outputted onto a destination array. The filter then takes a stride (literally) of a predetermined number of steps, and repeats. Once this filter is done, the filter is replaced with a different one and the entire process goes again. The output of this layer is a series of smaller arrays, known as a feature map. This gives the CNN numerous ways of extracting features, representing numerous perspectives of the same data. Then, the same filter process is repeated for the arrays in the feature map. After doing this a few times, a CNN often makes a decision using an FFNN, because at that point, the feature map consists only of small arrays that can be flattened down2.
I decided to train the plain GTSRB CNN for 80 epochs. Figure 3 shows how the loss changed over time and Figure 4 shows how the accuracy changed over time. In both, the progress was very drastic in the first two or three iterations, and afterwards both metrics very slowly but surely kept improving.
Convolutional Neural Networks have no doubt revolutionised machine learning and AI in general. Mirroring the different functions of cells in the brain, CNNs provide a novel method to recognise images and process human language. As the development of neural networks has continued non-stop, new branches of AI have blossomed, including ResNet, RetinaNet and SqueezeNet. The future of machine learning also looks as promising as ever, with Geoffrey Hinton, Yann LeCun, and Yoshua Bengio (the “Founding Fathers” of AI and winners of the 2018 Turing Award) predicting that neural networks with billions of parameters can be created with specialised hardware. Yet all of these breakthroughs rely on the innate power of the convolutional layer and the foundations laid by the Convolutional Neural Network, which with the promise of Big Data, is able to produce Big Rewards.
1Using this CNN, and a different deep learning architecture for object detection (determining an object’s location within an image), I created an app, called Speed Bot, that detects and recognises traffic signs in your live camera feed. Users are given the option to send the analysed frames and the recognised signs to an online database, where it will help improve the accuracy of my neural networks as well as help other autonomous driving algorithms. CNNs require large amounts of data, so please give the app a try, as every little helps. https://play.google.com/store/apps/details?id=org.uk.speedbot
2A really neat online simulation of a CNN was created by Adam Harley of the Carnegie Mellon University. His neural network was trained to recognise handwritten digits: https://www.cs.ryerson.ca/~aharley/vis/conv/.
- Sumit Saha, “A Comprehensive Guide to Convolutional Neural Networks – the ELI5 way”, Towards Data Science, Dec. 15, 2018
- Md Zahangir Alom, Tarek M. Taha, et al., “The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches”, Arxiv, Mar. 3, 2018
- Hugo Mayo, Hashan Punchihewa, Julie Emile, Jack Morrison, “Convolutional Neural Networks”, AI in Radiology, 2018
- Michael A. Nielsen, “Neural Networks and Deep Learning”, Determination Press, 2015
- Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS Proceedings
- Jason Brownlee, “How Do Convolutional Layers Work in Deep Learning Neural Networks?”, April 17, 2019
- Radja Hachilif, Riyadh Baghdadi, Fatima Zohra Benhamida, “Graduation Thesis Implementing and Optimizing Neural Networks using Tiramisu”, ResearchGate, June 2019