Featured Image

HTC (Hybrid Task Cascade) Network Architecture

As a result of my recent literature research for image segmentation, I have come across very different segmentation architectures. Before this article, I told you about the architecture of Mask R-CNN. Just like this mask R-CNN architecture, the Cascade Mask R-CNN structure has appeared in the literature. I will try to enlighten you about this with the information I have collected from the original academic documents and research I have read.

Cascade is a classic yet powerful architecture that improves performance in a variety of tasks. However, how to enter sample segmentation with steps remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN provides only limited gains. In exploring a more effective approach, it was found that the key to a successful instance segmentation level is to take full advantage of the mutual relationship between detection and partitioning.
Hybrid Task Cascade for Instance Segmentation proposes a new Hybrid Task Cascade (HTC) framework that differs in two important respects:

  1. Instead of cascading these two tasks separately, it connects them together for common multi-stage processing.
  2. It adopts a fully convoluted branch to provide spatial context, which can help distinguish the rigid foreground from the complex background.

The basic idea is to leverage spatial context to improve the flow of information and further improve accuracy by incorporating steps and multitasking at each stage. In particular, a cascading pipeline is designed for progressive purification. At each stage, both bounding box regression and mask prediction are combined in a multi-tasking person.

Innovations ✨

The main innovation of HTC’s architecture is a cascading framework that connects object detection and segmentation, providing better performance. The information flow is also changed through direct branches between the previous and subsequent mask determinants. Architecture also includes a fully convolutional branch that improves spatial context, which can improve performance by better distinguishing samples from scattered backgrounds.
2017 Winner

Hybrid Task Cascade: Sample Segmentation Framework
  • It combines bounding box regression and mask prediction instead of executing in parallel. 
  • It creates a direct way to strengthen the flow of information between mask branches by feeding the mask features from the previous stage to the existing one.
  • It aims to gain more contextual information by fusing it with box and mask branches by adding an additional branch of semantic segmentation. 
  • In general, these changes in the framework architecture effectively improve the flow of information not only between states but also between tasks.

A comparison of the HTC network’s sample determination approaches with the latest technology products in the COCO dataset in Table 1 can be seen. In addition, the Cascade Mask R-CNN described in Chapter 1 is considered a strong basis for the method used in the article. Compared to Mask R-CNN, the naive cascading baseline brings in 3.5% and 1.2% increases in terms of box AP and mask AP. It is noted that this baseline is higher than PANet, the most advanced method of sample segmentation. HTC is making consistent improvements on different backbones that prove its effectiveness. ResNet-50 provides gains of 1.5%, 1.3% and 1.1%, respectively, for ResNet-101 and ResNeXt-101.
📌 Note: Cascade Mask R-CNN extends Cascade R-CNN to instance segmentation by adding a mask header to the cascade [3].


The image below shows the results of this segmentation in the COCO dataset.
In the results section of the article, the advantages of the HTC model they created over other models are mentioned.

We recommend the hybrid task cascade (HTC), a new graded architecture for Instance Segmentation. It intertwines box and mask branches for common multi-stage processing and uses a semantic partitioning branch to provide spatial context. This framework gradually improves mask estimates and combines complementary features at each stage. The proposed method without bells and whistles achieves a 1.5% improvement over a strong cascade Mask R-CNN baseline in the MS COCO dataset. In particular, our overall system reaches 48.6 masks AP in the test-inquiry dataset and 49.0 mask AP in test-dev.

📌 Finally, in order to understand the changes of variables in the table, I leave you a table of MS COCO metrics as a note.


  1. Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, Hybrid Task Cascade for Instance Segmentation, April 2019.
  2. Zhaowei Cai and Nuno Vasconcelos, Cascader-cnn:Delving into high quality object detection, In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  3. https://paperswithcode.com/method/cascade-mask-r-cnn.
  4. https://cocodataset.org/#home

SSD(Single Shot Multibox Detector) model from A to Z

In this article, we will learn the SSD MultiBox object detection technique from A to Z with all its descriptions. Because the SSD model works much faster than the RCNN or even Faster R-CNN architecture, it is sometimes used when it comes to object detection.
This model, introduced by Liu and his colleagues in 2016, detects an object using background information [2]. Single Shot Multibox Detector i.e. single shot multibox detection (SSD) with fast and easy modeling will be done. And what can be mentioned by one shot? As you can understand from the name, it offers us the ability to detect objects at once.

I’ve collated a lot of documents, videos to give you accurate information, and I’m starting to tell you the whole alphabet of the job. In RCNN networks, regions that are likely to be objects were primarily identified, and then these regions were classified with Fully Connected layers. Object detection is performed in 2 separate stages with the RCNN network, while SSD performs these operations in one step.
As a first step, let’s examine the SSD architecture closely. If the image sounds a little small, you can zoom in and see the contents and dimensions of the convolution layers.

An image is given as input to the architecture as usual. This image is then passed through convolutional neural networks. If you have noticed, the dimensions of convolutional neural networks are different. In this way, different feature maps are extracted in the model. This is a desirable situation. A certain amount of limiting rectangles is obtained using a 3×3 convolutional filter on property maps.
Because these created rectangles are on the activation map, they are extremely good at detecting objects of different sizes. In the first image I gave, an image of 300×300 was sent as input. If you notice, the image sizes have been reduced as you progress. In the most recent convolutional nerve model, the size was reduced to 1. Comparisons are made between the limits set during the training process and the estimates realized as a result of the test. A 50% method is used to find the best among these estimates. A result greater than 50% is selected. You can think of it as the situation that exists in logistical regression.
For example, the image dimensions are 10×10×512 in Conv8_2. It will have outputs (classes + 4) for each bounding box when the 3×3 convolutional operation is applied and using 4 bounding boxes. Thus, in Conv8_2, the output is 10×10×4×(C+4). Assume that there are 10 object classes for object detection and an additional background class. Thus output 10×10×4×(11+4)=6000 will be. Bounding boxes will reach the number 10×10×4 = 400. It ends the image it receives as input as a sizeable Tensor output. In a video I researched, I listened to a descriptive comment about this district election:

Instead of performing different operations for each region, we perform all forecasts on the CNN network at once.

4 bounding boxes are estimated in each cell in the area on the right side, while the image seen on the left in the image above is original [3]. In the grid structures seen here, there are bounding rectangles. In this way, an attempt is made to estimate the actual region in which the object is located.
In the documents I researched, I scratched with the example I gave above. I really wanted to share it with you, because it is an enormous resource for understanding SSD architecture. Look, if you’ve noticed, he’s assigned a percentage to objects that are likely to be in the visual. For example, he gave the car a 50% result. But he will win because the odds above 50% will be higher. So in this visual, the probability that it is a person and a bicycle is more likely than it is a car. I wish you understood the SSD structure. In my next article, I will show you how to code the SSD model.Hope you stay healthy ✨


  1. Face and Object Recognition with computer vision | R-CNN, SSD, GANs, Udemy.
  2. Dive to Deep Learning, 13.7. Single Shot Multibox Detection (SSD), https://d2l.ai/chapter_computer-vision/ssd.html.
  3. https://jonathan-hui.medium.com/ssd-object-detection-single-shot-multibox-detector-for-real-time-processing-9bd8deac0e06.
  4. https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11.
  5. https://towardsdatascience.com/understanding-ssd-multibox-real-time-object-detection-in-deep-learning-495ef744fab.
  6. Single-Shot Bidirectional Pyramid Networks for High-Quality Object Detection, https://www.groundai.com/project/single-shot-bidirectional-pyramid-networks-for-high-quality-object-detection/1.

What's the difference? Artificial Intelligence, Machine Learning and Deep Learning

Welcome to the world of artificial intelligence! By the end of this article you’ll fully understand top 3 concepts in technology: artificial intelligence, machine learning and deep learning. Even though most people use them interchangeably, they don’t have the same meanings. Let’s dig in deeper.