SSD(Single Shot Multibox Detector) model from A to Z

In this article, we will learn the SSD MultiBox object detection technique from A to Z with all its descriptions. Because the SSD model works much faster than the RCNN or even Faster R-CNN architecture, it is sometimes used when it comes to object detection.
This model, introduced by Liu and his colleagues in 2016, detects an object using background information [2]. Single Shot Multibox Detector i.e. single shot multibox detection (SSD) with fast and easy modeling will be done. And what can be mentioned by one shot? As you can understand from the name, it offers us the ability to detect objects at once.

I’ve collated a lot of documents, videos to give you accurate information, and I’m starting to tell you the whole alphabet of the job. In RCNN networks, regions that are likely to be objects were primarily identified, and then these regions were classified with Fully Connected layers. Object detection is performed in 2 separate stages with the RCNN network, while SSD performs these operations in one step.
As a first step, let’s examine the SSD architecture closely. If the image sounds a little small, you can zoom in and see the contents and dimensions of the convolution layers.

An image is given as input to the architecture as usual. This image is then passed through convolutional neural networks. If you have noticed, the dimensions of convolutional neural networks are different. In this way, different feature maps are extracted in the model. This is a desirable situation. A certain amount of limiting rectangles is obtained using a 3×3 convolutional filter on property maps.
Because these created rectangles are on the activation map, they are extremely good at detecting objects of different sizes. In the first image I gave, an image of 300×300 was sent as input. If you notice, the image sizes have been reduced as you progress. In the most recent convolutional nerve model, the size was reduced to 1. Comparisons are made between the limits set during the training process and the estimates realized as a result of the test. A 50% method is used to find the best among these estimates. A result greater than 50% is selected. You can think of it as the situation that exists in logistical regression.
For example, the image dimensions are 10×10×512 in Conv8_2. It will have outputs (classes + 4) for each bounding box when the 3×3 convolutional operation is applied and using 4 bounding boxes. Thus, in Conv8_2, the output is 10×10×4×(C+4). Assume that there are 10 object classes for object detection and an additional background class. Thus output 10×10×4×(11+4)=6000 will be. Bounding boxes will reach the number 10×10×4 = 400. It ends the image it receives as input as a sizeable Tensor output. In a video I researched, I listened to a descriptive comment about this district election:

Instead of performing different operations for each region, we perform all forecasts on the CNN network at once.

4 bounding boxes are estimated in each cell in the area on the right side, while the image seen on the left in the image above is original [3]. In the grid structures seen here, there are bounding rectangles. In this way, an attempt is made to estimate the actual region in which the object is located.
In the documents I researched, I scratched with the example I gave above. I really wanted to share it with you, because it is an enormous resource for understanding SSD architecture. Look, if you’ve noticed, he’s assigned a percentage to objects that are likely to be in the visual. For example, he gave the car a 50% result. But he will win because the odds above 50% will be higher. So in this visual, the probability that it is a person and a bicycle is more likely than it is a car. I wish you understood the SSD structure. In my next article, I will show you how to code the SSD model.Hope you stay healthy ✨


  1. Face and Object Recognition with computer vision | R-CNN, SSD, GANs, Udemy.
  2. Dive to Deep Learning, 13.7. Single Shot Multibox Detection (SSD),
  6. Single-Shot Bidirectional Pyramid Networks for High-Quality Object Detection,

Leave a Reply

Your email address will not be published. Required fields are marked *