Image Classification of CIFAR-10 Dataset with Convolutional Neural Network, Part 1: Literature Review

Literature review of Convolutional Neural Networks (CNN) and their application in image classification tasks of the CIFAR-10 Dataset

Published on 20 August 2023

Image Classification of CIFAR-10 Dataset with Convolutional Neural Network, Part 1: Literature Review

In the first article of the series of three articles, I want to investigate some of the existing state-of-the-art Convolutional Neural Networks (at the time of writing in 2022), and prepare a groundwork for the exploration of regularisation techniques applied to address overfitting and underfitting issues. In the final article, I want to demonstrate several experiments that test those techniques to produce an architecture that takes the least time to train, while keeping the overfitting and underfitting to a minimum. My goal is to propose an artificial neural network that should take no more than an hour to train, and be able to classify images within the CIFAR-10 dataset with an accuracy higher than 70%. However, to start with, let's look at Computer Vision.

Computer Vision

Computer Vision (CV) utilises algorithms that should quickly and accurately perform a multi-class classification of images on mobile and embedded systems, whilst preserving high performance. Studies such as those conducted by Recht et al. (2018) investigated the field of deep learning and Convolutional Neural Networks. The goal of their study was to create a model that generalises well to unseen data, i.e., the model can take a never seen set of images and perform an accurate classification of those images. This is quite important because the application of CV algorithms is often seen in industries that rely on quick and accurate responses from the systems responsible for image classification. For instance, industries specialising in security (e.g., facial recognition), automotive (e.g., self-driving cars), logistics (e.g., drones), or remote sensing satellites in the Low Earth Orbit (Khadhim and Abed, 2019). Let us pause for a moment and think about the dangers of slow and inaccurate image classification in the above industries. Below is just a few examples, but can you think of anything else?

  • Security: facial regonition takes long time to identify you, or grants access to your smartphone to a person who looks like you.
  • Automotive: a self-driving car fails to quickly response to a road hazard causing harm, injury or death.
  • Logistics: a delivery of goods to wrong address, or a collision with an airborne animal, physical obstacle, or another drone.
  • Remote sensing: skewed data causing innacurate interpretation of atmospheric activity, spread of wildfires, or dissapearance of rainforests.

According to Çalik and Demirci (2018), deeper and more complex networks have better performance, however, according to Gavrilov et al. (2018), such networks often experience overfitting and underfitting issues affecting their ability to generalise well to never seen before data. Nevertheless, to understand Computer Vision, we need to take a step back and look at what is a Convolutional Neural Network.

Convolutional Neural Network

In a nutshell, a CNN is an artificial neural network that falls into a branch of deep learning designed to accomplish discriminative tasks in Computer Vision such as aforementioned image classification, but also object detection, and image segmentation (Shorten and Khoshgoftaar, 2019). A simple CNN architecture is made of a feed-forward neural network that serves as the input layer followed by a combination of multiple hidden convolutional or densely connected layers that compute and output the probability of what is shown in the image (Gavrilov et al., 2018).

The literature review shows that there are numerous state-of-the-art CNNs, each with its own architectures, parameter configurations, and performance, e.g., Deep Belief Networks, Residual Attention Networks, MobileNets, or ShuffleNets. Furthermore, the literature shows that in order to develop a network that is able to perform classification and recognition tasks with high accuracy, a good practice is to reduce overfitting and underfitting (Gavrilov et al., 2018; Shorten and Khoshgoftaar, 2019) defined by Chollet (2019) as a product of the tension between optimisation, i.e., to get the best performance on the training data, and generalisation, i.e., how well the model generalises to unseen data, which can be achieved with regularisation.

Regularisation Techniques

My further study of the literature revealed that there are numerous regularisation techniques, and choosing the appropriate technique could be difficult. However, there are tools that could help you to choose the right technique. For example, Krizhevsky and Hinton (2010) performed a grid search over multiple parameters, e.g., initial learning rate, dropout, and weight decay in order to find the hyperparameter configuration that reduces the overfitting of their Deep Belief Network. Furthermore, the authors claimed that the centre of the image is likely to have more features, think of the focal point in photography, where the main subject draws immediate attention, or the grid of 9 in your smartphone camera where the middle square guides you to focus the main subject in the centre. Nevertheless, according to Chollet (2019), to prevent overfitting, the simplest solution is to get more training data, and the author proposed an image augmentation technique. For example, Abouelnaga et al. (2016) take the existing images and apply an image transformation such as scaling, rotation, position, and background alteration, effectively increasing the number of samples available to train the model.

According to Shorten and Khoshgoftaar (2019), techniques like this will help to create a model that exhibits a decreasing loss value, reduce overfitting, and improve generalisation (classification of never seen before images). Nevertheless, Chollet (2019) proposed alternative approaches such as a reduction of the model size, addition of L1 and L2 regularisation, and addition of the dropout layer with a rate between 0.2 for smaller networks, and 0.5 for larger networks. In another study, Xe et al. (2019) proposed adding a pooling layer that according to Chollet (2019) is an effective way to downsample the data and keep the number of features to a minimum. Both authors claimed that the pooling layer will regularise the model and reduce overfitting. Further studies also show that a good strategy is to apply dropout and Batch Normalisation to train a deep network whilst avoiding overfitting.

Optimisation Techniques

The deeper I dug, the more techniques I was able to find, and also I came across optimisation techniques. In short, optimisation is about choosing the right optimiser to maximise the performance of the model we want to train. As shown by Buduma (2017) and Géron (2019), the optimiser takes the initial learning rate value that can be adjusted during training through a callback function called the learning rate scheduler available in, for example, TensorFlow (2022), and automatically minimise the error rate after each epoch (Buduma, 2017). Chollet (2019) asserted that optimisers specify how the gradient of the loss function will be used to update the model's parameters, the author also said that there are several optimisers available such as Stochastic Gradient Descent (SGD), Root Mean Squared Propagation (RMSProp), and Adam optimiser, which, according to Brownlee (2019) is the combination of Adaptive Gradient Descent and RMSProp.

Further observations by Saleem et al. (2020) revealed that optimisers have a considerable impact on the performance of CNN architectures such as the F1-score generally used as a primary metric to measure to the quality of trained models. The authors concluded that the Adam optimiser is the most successful. Bera and Shrivastava (2020) observed similar results in their analysis and both studies concluded that Adam and RMSProp improve the accuracy and reduce the computation time required to train a CNN. Nevertheless, Zhang et al. (2019) proposed a linear decay that further improves the accuracy by gradually decreasing the learning rate during training. However, the authors argued that a good practice is to use validation data while training and keep the test data to evaluate the model. The authors claimed that around 20% of training data should be reserved for validation. A similar technique has been proposed by Chansung (2018).

Network Architectures

In the final piece of the literature review, I explored a few network architectures. I found several studies particularly interesting. In one of them, Wang et al (2017) put forward the idea that feed-forward convolutional networks mimic the workings of the human cortex. An example of such architecture is a VGG architecture that involves the addition of depth to the CNN. Simonyan and Zisserman (2015) demonstrated a significant improvement in the accuracy of VGG architectures. According to Hammad and El-Sankary (2018), the number of parameters in the VGG architecture can exceed 100 million, however, the 'blocky' architecture means that this number can be reduced by eliminating a number of blocks, i.e., by using only the first three blocks, as demonstrated by Brownlee (2019).

I mentioned at the start of this article other networks such as MobileNets. According to Howard et al. (2017) and Sandler et al. (2019), these architectures are designed to match the limits of smaller, embedded systems. Howard et al. demonstrated a technique that introduces width and resolution multipliers. The authors concluded that the width multiplier makes the network thinner, whereas the resolution multiplier reduces that computational costs when applied to the input image, making it a good choice for embedded systems. Moreover, Sandler et al. observed that inverted residual 'bottleneck' layers have the potential to improve the efficiency of CNNs. Both techniques have been found to reduce the need for memory, whilst providing a reasonable accuracy and time reduction.

Final Thoughts

With that in mind, I think I now understand the fundamentals of Computer Vision, especially the associated algorithms and network architectures. However, it feels like there is no right 'recipe' to train a model in less than an hour, that will also exhibit overall accuracy of 70%. Although the literature review provided a guideline and helped to identify what other authors did to train their models, I think that the next steps will involve some sort of experimentation to achieve the desired results. In the next article, I will focus on methodology and experiments, starting with defining a baseline model.


Abouelnaga, Y., Ali, O.S., Rady, H. and Moustafa, M., 2016. CIFAR-10: KNN-based Ensemble of Classifiers. In 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1192-1195. IEEE.

Bera, S. and Shrivastava, V.K., 2020. Analysis of various optimizers on deep convolutional neural network model in the application of hyperspectral remote sensing image classification. International Journal of Remote Sensing, 41(7), pp. 2664-2683.

Brownlee, J. (2019) How to Develop a CNN From Scratch for CIFAR-10 Photo Classification [Online] Available at: scratch-for-cifar-10-photo-classification/ [Accessed: 16 April 2022].

Çalik, R.C., and Demirci, M.F., 2018. Cifar-10 Image Classification with Convolutional Neural Networks for Embedded Systems. In 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA), pp. 1-2, doi:10.1109/AICCSA.2018.8612873.

Chansung, P. (2018) CIFAR-10 Image Classification in TensorFlow. [Online] Available at: https:// [Accessed: 18 April 2022].

Gavrilov, A.D., Jordache, A., Vasdani, M. and Deng, J., 2018. Preventing model overfitting and underfitting in convolutional neural networks. International Journal of Software Science and Computational Intelligence (IJSSCI), 10(4), pp. 19-28.

Géron, A. 2019. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Inc.

Howard, A.G., Zhu, M., Chen, B, et al., 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint: 1704.04861v1.

Krizhevsky, A. and Hinton, G., 2010. Convolutional deep belief networks on cifar-10. Unpublished Manuscript, 40(7), pp.1-9.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V., 2018. Do CIFAR-10 classifiers generalize to CIFAR-10? arXiv preprint: 1806.00451.

Sandler, M., Howard, A., Zhu, M., et al., 2019. MobileNetV2: Inverted Residual and Linear Bottlenecks. arXiv preprint: 1801.04381v4.

Saleem, M.H., Potgieter, J. and Arif, K.M., 2020. Plant disease classification: A comparative evaluation of convolutional neural networks and deep learning optimizers. Plants, 9(10), p.1319.

Shorten, C. and Khoshgoftaar, T.M. 2019. A survey on Image Augmentation for Deep Learning. In Journal of Big Data, 6(60), doi:

Simonyan, K. and Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556.

TensorFlow (2022) TensorFlow Core v2.8.0. [Online] Available at: api_docs [Accessed: 24 April 2022].

Wang, F., Jiang, M., Qian, C., et al., 2017. Residual Attention Network for Image Classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).

Xu, Q., Zhang, M., Gu, Z. and Pan, G., 2019. Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs. Neurocomputing, 328, pp. 69-74.

Zhang, X., Zhou, X., Lin, M. and Sun, J., 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848-6856.

A view of the Newcastle-Gateshead Quaside from the Tyne Bridge

Let's work together to bring your digital dream to life.

Get in touch to book a free consultation