Multi-Scale image segmentation


Disclaimer

I tried to reproduce the results from Hierarchical Multi-Scale Attention for Semantic Segmentation (paper link). I had no contribution over the original paper, all I did was to recreate the same results from the paper.


1. Introduction

Image segmentation is a method for categorizing each pixel in a frame. Object detection is one of the many applications that image segmentation has. Instead of processing the entire image, a common practice is first to use image segmentation to find an approximate of the object's region. Then, the object detector can operate on a bounding box already defined by the segmentation algorithm. This prevents the detector from processing the entire image, improving accuracy and reducing inference time. There are many different segmentation models out there, all with unique capabilities and constraints. The Andrew Tao paper introduced a model that uses different scales of images as input while using output information to make a more accurate segmentation mask.

1.1 Hierarchical multi-scale attention

One of Andrew Tao et. al main contributions were where for each scale, a dense mash was learned, and they combined the multi-scale predictions in a pixel-wise manner, followed by pixel-wise summation between different scales to get the final output. They also refer to Chen’s method as explicit. They used their hierarchical method to learn a relative attention mask between adjacent scales instead of learning attention masks for each fixed scale set. During training, they only trained them with the adjacent scale pair. This allows the network network learns to predict relative attention for a range of image scales

2. Architect

Figure 1 belonging to the original paper, shows the architect of the entire model.
Figure 1: Image from original paper

2.1 Model components

3. Dataset

The dataset that we were using in the following project was the Cityscapes dataset, which was labeled for semantic understanding. But we did not need the entire dataset since image segmentation was our only objective.

3.1 Data samples

Figure 2 shows a sample ground image and it's segment map.
Figure 2

3.2 Class Definitions

Group Classes
Flat road, sidewalk,parking,rail track
Human person, rider
Vehicle car, truck,bus, on rails, motor cycle
Construction building, wall,fense, guard rail, caravan, bridge
Object pole, pole group, traffic sign, traffic light
Nature vegetation, terrain
Sky sky
void ground, dynamic, static
Table 1

4. Training

In the beginning, our ultimate goal was to build up two-stage and three-stage hierarchical multi-scale attention by trying both Resnet-50 and HRNet-OCD as the trunk and the DeepLab v3+ as the segmentation block. However, this was not possible due to our computational limitations (especially the small GPU memory). Inevitably, we replaced the HRNet-OCD architecture with MobileNet in our experience due to its smaller size and fewer parameters. Even after this, we were still unable to train our models end to end; instead, we relied on transfer learning methods to obtain acceptable results. Our main approach in training the hierarchical models was to train different 3 segmentation models for segmenting images with three scales (2, 1, and 0.5) and eventually use the trained trunk and segmentation blocks for the original models. We use the same architecture as the hierarchical models' initial stage for training trunks and segmentation blocks on images with various sizes.


5. Results

Since the entire was huge, we weren't able to train this model in an end-to-end manner. Instead, we trained each part separately and combined them in the end. Figure 3 shows an example of the network we trained. The top image represents the original image. The middle represents the ground truth segmentation. The bottom image represents the predicted segmentation from our model.
Figure 3
Table 2 shows the result for different architectures.
Trunk Segmentation Block Image Scale #Epochs Mean IOU Average Accuracy
MobileNet DeepLab V3+ 0.5 7 51.4% 71.3%
MobileNet DeepLab V3+ 1 7 50.6% 68.7%
MobileNet DeepLab V3+ 2 5 50.3% 68.2%
ResNet DeepLab V3+ 0.5 6 50.1% 68.4%
ResNet DeepLab V3+ 1 6 48.0% 76.3%
ResNet DeepLab V3+ 2 5 49.0% 68.1%
Table 2

More infomation