Article Review: Multi-Category Classification with CNN

Classification of Multi-Category Images Using Deep Learning: A Convolutional Neural Network Model

In this article, the article ‘Classifying multi-category images using Deep Learning: A Convolutional Neural Network Model’ presented in India in 2017 by Ardhendu Bandhu, Sanjiban Sekhar Roy is being reviewed. An image classification model using a convolutional neural network is presented with TensorFlow. TensorFlow is a popular open-source library for machine learning and deep neural networks. A multi-category image dataset was considered for classification. Traditional back propagation neural network; has an input layer, hidden layer, and an output. A convolutional neural network has a convolutional layer and a maximum pooling layer. We train this proposed classifier to calculate the decision boundary of the image dataset. Real-world data is mostly untagged and unstructured. This unstructured data can be an image, audio, and text data. Useful information cannot be easily derived from neural networks that are shallow, meaning they are those with fewer hidden layers. A deep neural network-based CNN classifier is proposed, which has many hidden layers and can obtain meaningful information from images.

Keywords: Image, Classification, Convolutional Neural Network, TensorFlow, Deep Neural Network.

First of all, let’s examine what classification is so that we can understand the steps laid out in the project. Image Classification refers to the function of classifying images from a multi-class set of images. To classify an image dataset into multiple classes or categories, there must be a good understanding between the dataset and classes.

In this article;

  1. Convolutional Neural Network (CNN) based on deep learning is proposed to classify images.
  2. The proposed model achieves high accuracy after repeating 10,000 times within the dataset containing 20,000 images of dogs and cats, which takes about 300 minutes to train and validate the dataset.

In this project, a convolutional neural network consisting of a convolutional layer, RELU function, a pooling layer, and a fully connected layer is used. A convolutional neural network is an automatic choice when it comes to image recognition using deep learning.

Convolutional Neural Network

For classification purposes, it has the architecture as the convolutional network [INPUT-CONV-RELU-POOL-FC].

INPUT- Raw pixel values as images.

CONV- Contents output in the first cluster of neurons.

RELU- It applies the activation function.

POOL- Performs downsampling.

FC- Calculates the class score.

In this publication, a multi-level deep learning system for picture characterization is planned and implemented. Especially the proposed structure;

1) The picture shows how to find nearby neurons that are discriminatory and non-instructive for grouping problem.

2) Given these areas, it is shown how to view the level classifier.


A data set containing 20,000 dog and cat images from the Kaggle dataset was used. The Kaggle database has a total of 25000 images available. Images are divided into training and test sets. 12,000 images are entered in the training set and 8,000 images in the test set. Split dataset of the training set and test set helps cross-validation of data and provides a check over errors; Cross-validation checks whether the proposed classifier classifies cat or dog images correctly.

The following experimental setup is done on Spyder, a scientific Python development environment.

  1. First of all, Scipy, Numpy, and Tensorflow should be used as necessary.
  2. A start time, training path, and test track must be constant. Image height and image width were provided as 64 pixels. The image dataset containing 20,000 images is then loaded. Due to a large number of dimensions, it is resized and iterated. This period takes approximately 5-10 minutes.
  3. This data is fed by TensorFlow. In TensorFlow, all data is passed between operations in a calculation chart. Properties and labels must be in the form of a matrix for the tensors to easily interpret this situation.
  4. Tensorflow Prediction: To call data within the model, we start the session with an additional argument where the name of all placeholders with the corresponding data is placed. Because the data in TensorFlow is passed as a variable, it must be initialized before a graph can be run in a session. To update the value of a variable, we define an update function that can then run.
  5. After the variables are initialized, we print the initial value of the state variable and run the update process. After that, the rotation of the activation function, the choice of the activation function has a great influence on the behavior of the network. The activation function for a specific node is an input or output of the specific node provided a set of inputs.
  6. Next, we define the hyperparameters we will need to train our login features. In more complex neural networks, we encounter more hyperparameters. Some of our hyperparameters can be like the learning rate.
    Another hyperparameter is the number of iterations we train our data. The next hyperparameter is the batch size, which chooses the size of the set of images to send for classification at one time.
  7. Finally, after all that, we start the TensorFlow session, which makes TensorFlow work, because without starting a session a TensorFlow won’t work. After that, our model will start the training process.


🖇 As deep architecture, we used a convolutional neural network and also implemented the TensorFlow deep learning library. The experimental results below were done on Spyder, a scientific Python development environment. 20,000 images were used and the batch is fixed at 100 for you.

🖇 It is essential to examine the accuracy of the models in terms of test data rather than training data. To run the convolutional neural network using TensorFlow, the Windows 10 machine was used, the hardware was specified to have an 8GB of RAM with the CPU version of TensorFlow.

📌 As the number of iterations increases, training accuracy increases, but so does our training time. Table 1 shows the number line with the accuracy we got.

Number of iterations vs AccuracyThe graph has become almost constant after several thousand iterations. Different batch size values can lead to different results. We set a batch size value of 100 for images.

✨ In this article, a high accuracy rate was obtained in classifying images with the proposed method. The CNN neural network was implemented using TensorFlow. It was observed that the classifier performed well in terms of accuracy. However, a CPU based system was used. So the experiment took extra training time, if a GPU-based system was used, training time would be shortened. The CNN model can be applied in solving the complex image classification problem related to medical imaging and other fields.


  2. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504- 507.
  3. Yoshua Bengio, “Learning Deep Architectures for AI”, Dept. IRO, Universite de Montreal C.P. 6128, Montreal, Qc, H3C 3J7, Canada, Technical Report 1312.
  4. Yann LeCun, Yoshua Bengio & Geoffrey Hinton, “Deep learning “, NATURE |vol 521 | 28 May 2015
  5. Yicong Zhou and Yantao Wei, “Learning Hierarchical Spectral–Spatial Features for Hyperspectral Image Classification”, IEEE Transactions on cybernetics, Vol. 46, No.7, July 2016.

SSD(Single Shot Multibox Detector) model from A to Z

In this article, we will learn the SSD MultiBox object detection technique from A to Z with all its descriptions. Because the SSD model works much faster than the RCNN or even Faster R-CNN architecture, it is sometimes used when it comes to object detection.
This model, introduced by Liu and his colleagues in 2016, detects an object using background information [2]. Single Shot Multibox Detector i.e. single shot multibox detection (SSD) with fast and easy modeling will be done. And what can be mentioned by one shot? As you can understand from the name, it offers us the ability to detect objects at once.

I’ve collated a lot of documents, videos to give you accurate information, and I’m starting to tell you the whole alphabet of the job. In RCNN networks, regions that are likely to be objects were primarily identified, and then these regions were classified with Fully Connected layers. Object detection is performed in 2 separate stages with the RCNN network, while SSD performs these operations in one step.
As a first step, let’s examine the SSD architecture closely. If the image sounds a little small, you can zoom in and see the contents and dimensions of the convolution layers.

An image is given as input to the architecture as usual. This image is then passed through convolutional neural networks. If you have noticed, the dimensions of convolutional neural networks are different. In this way, different feature maps are extracted in the model. This is a desirable situation. A certain amount of limiting rectangles is obtained using a 3×3 convolutional filter on property maps.
Because these created rectangles are on the activation map, they are extremely good at detecting objects of different sizes. In the first image I gave, an image of 300×300 was sent as input. If you notice, the image sizes have been reduced as you progress. In the most recent convolutional nerve model, the size was reduced to 1. Comparisons are made between the limits set during the training process and the estimates realized as a result of the test. A 50% method is used to find the best among these estimates. A result greater than 50% is selected. You can think of it as the situation that exists in logistical regression.
For example, the image dimensions are 10×10×512 in Conv8_2. It will have outputs (classes + 4) for each bounding box when the 3×3 convolutional operation is applied and using 4 bounding boxes. Thus, in Conv8_2, the output is 10×10×4×(C+4). Assume that there are 10 object classes for object detection and an additional background class. Thus output 10×10×4×(11+4)=6000 will be. Bounding boxes will reach the number 10×10×4 = 400. It ends the image it receives as input as a sizeable Tensor output. In a video I researched, I listened to a descriptive comment about this district election:

Instead of performing different operations for each region, we perform all forecasts on the CNN network at once.

4 bounding boxes are estimated in each cell in the area on the right side, while the image seen on the left in the image above is original [3]. In the grid structures seen here, there are bounding rectangles. In this way, an attempt is made to estimate the actual region in which the object is located.
In the documents I researched, I scratched with the example I gave above. I really wanted to share it with you, because it is an enormous resource for understanding SSD architecture. Look, if you’ve noticed, he’s assigned a percentage to objects that are likely to be in the visual. For example, he gave the car a 50% result. But he will win because the odds above 50% will be higher. So in this visual, the probability that it is a person and a bicycle is more likely than it is a car. I wish you understood the SSD structure. In my next article, I will show you how to code the SSD model.Hope you stay healthy ✨


  1. Face and Object Recognition with computer vision | R-CNN, SSD, GANs, Udemy.
  2. Dive to Deep Learning, 13.7. Single Shot Multibox Detection (SSD),
  6. Single-Shot Bidirectional Pyramid Networks for High-Quality Object Detection,