Featured Image

HTC (Hybrid Task Cascade) Network Architecture

As a result of my recent literature research for image segmentation, I have come across very different segmentation architectures. Before this article, I told you about the architecture of Mask R-CNN. Just like this mask R-CNN architecture, the Cascade Mask R-CNN structure has appeared in the literature. I will try to enlighten you about this with the information I have collected from the original academic documents and research I have read.

Cascade is a classic yet powerful architecture that improves performance in a variety of tasks. However, how to enter sample segmentation with steps remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN provides only limited gains. In exploring a more effective approach, it was found that the key to a successful instance segmentation level is to take full advantage of the mutual relationship between detection and partitioning.
Hybrid Task Cascade for Instance Segmentation proposes a new Hybrid Task Cascade (HTC) framework that differs in two important respects:

  1. Instead of cascading these two tasks separately, it connects them together for common multi-stage processing.
  2. It adopts a fully convoluted branch to provide spatial context, which can help distinguish the rigid foreground from the complex background.

The basic idea is to leverage spatial context to improve the flow of information and further improve accuracy by incorporating steps and multitasking at each stage. In particular, a cascading pipeline is designed for progressive purification. At each stage, both bounding box regression and mask prediction are combined in a multi-tasking person.

Innovations ✨

The main innovation of HTC’s architecture is a cascading framework that connects object detection and segmentation, providing better performance. The information flow is also changed through direct branches between the previous and subsequent mask determinants. Architecture also includes a fully convolutional branch that improves spatial context, which can improve performance by better distinguishing samples from scattered backgrounds.
2017 Winner

Hybrid Task Cascade: Sample Segmentation Framework
  • It combines bounding box regression and mask prediction instead of executing in parallel. 
  • It creates a direct way to strengthen the flow of information between mask branches by feeding the mask features from the previous stage to the existing one.
  • It aims to gain more contextual information by fusing it with box and mask branches by adding an additional branch of semantic segmentation. 
  • In general, these changes in the framework architecture effectively improve the flow of information not only between states but also between tasks.

A comparison of the HTC network’s sample determination approaches with the latest technology products in the COCO dataset in Table 1 can be seen. In addition, the Cascade Mask R-CNN described in Chapter 1 is considered a strong basis for the method used in the article. Compared to Mask R-CNN, the naive cascading baseline brings in 3.5% and 1.2% increases in terms of box AP and mask AP. It is noted that this baseline is higher than PANet, the most advanced method of sample segmentation. HTC is making consistent improvements on different backbones that prove its effectiveness. ResNet-50 provides gains of 1.5%, 1.3% and 1.1%, respectively, for ResNet-101 and ResNeXt-101.
📌 Note: Cascade Mask R-CNN extends Cascade R-CNN to instance segmentation by adding a mask header to the cascade [3].


The image below shows the results of this segmentation in the COCO dataset.
In the results section of the article, the advantages of the HTC model they created over other models are mentioned.

We recommend the hybrid task cascade (HTC), a new graded architecture for Instance Segmentation. It intertwines box and mask branches for common multi-stage processing and uses a semantic partitioning branch to provide spatial context. This framework gradually improves mask estimates and combines complementary features at each stage. The proposed method without bells and whistles achieves a 1.5% improvement over a strong cascade Mask R-CNN baseline in the MS COCO dataset. In particular, our overall system reaches 48.6 masks AP in the test-inquiry dataset and 49.0 mask AP in test-dev.

📌 Finally, in order to understand the changes of variables in the table, I leave you a table of MS COCO metrics as a note.


  1. Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, Hybrid Task Cascade for Instance Segmentation, April 2019.
  2. Zhaowei Cai and Nuno Vasconcelos, Cascader-cnn:Delving into high quality object detection, In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  3. https://paperswithcode.com/method/cascade-mask-r-cnn.
  4. https://cocodataset.org/#home