Dual Attention Network for Scene Segmentation論文筆記

??@Jimmy 2019-03-12 15:01:40

一、基本信息

標題：《Dual Attention Network for Scene Segmentation》

時間：2019

出版源：CVPR 2019

論文領域：語義分割（Object Detection）

主要鏈接：

homepage: None
arXiv(Paper): https://arxiv.org/abs/1809.02983
github(Official): https://github.com/junfu1115/DANet

二、研究背景

問題： 為了有效地完成場景分割的任務，我們需要區分一些混淆的類別，並考慮不同外觀的對象。例如，草原與牧場有時候是很難區分的，公路上的車也存在尺度、視角、遮擋與亮度等的變化。因此，像素級識別需要提高特徵表示的識別能力。 In order to accomplish the task of scene segmentation effectively, we need to distinguish some confusing categories and take into account objects with different appearance.
現有解決：

多尺度上下文信息融合 (multi-scale context fusion) : PSPNet etc.
通過使用分解結構或在網路頂部引入有效的編碼層來增大內核大小，從而獲取更豐富的全局上下文信息
encoder-decoder 結構問：這樣做的缺點是什麼？答：以上方法可以捕獲不同尺度的目標，但是它沒有利用目標之間的關係，這些對於場景表達也是重要的。
使用遞歸神經網路來捕捉長期依賴關係: 例如2D的LSTM。問：這樣做的缺點是什麼？

答：有效性在很大程度上依賴於長期記憶的學習結果。

三、創新點

3.1 概述

要點： 這篇論文通過基於Self Attention mechanism來捕獲上下文依賴，並提出了Dual Attention Networks (DANet)來自適應地整合局部特徵和全局依賴。該方法能夠自適應地聚合長期上下文信息，從而提高了場景分割的特徵表示。
組成： 在一貫的dilated FCN中加入兩種類型地attention module。其中position attention module選擇性地通過所有位置的加權求和聚集每個位置的特徵，channel attention module通過所有channle的feature map中的特徵選擇性地強調某個特徵圖。最後將兩種attention module的output 求和得到最後的特徵表達。 The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results.
貢獻：
提出了Dual Attention Networks (DANet)在spatial和channle維度來捕獲全局特徵依賴。
提出position attention module去學習空間特徵的相關性，提出channel attention module去建模channle的相關性。
在三個數據集Cityscapes， PASCAL Context和COCO Stuff上實現了state-of-the-art的結果。

3.2 詳解

網路構架如下圖：

採用移除down-sampling的dilated ResNet(與DeepLab相同)的預訓練網路基礎網路為，最後得到的feature map大小為輸入圖像的1/8。之後是兩個並行的attention module分別捕獲spatial和channel的依賴性，最後整合兩個attention module的輸出得到更好的特徵表達。

3.2.1 Position Attention Module

捕獲特徵圖的任意兩個位置之間的空間依賴，對於某個特定的特徵，被所有位置上的特徵加權和更新。權重為相應的兩個位置之間的特徵相似性。因此，任何兩個現有相似特徵的位置可以相互貢獻提升，而不管它們之間的距離。

The position attention module encodes a wider range of contextual information into local features, thus enhancing their representative capability.

特徵圖A(C×H×W)首先分別通過3個卷積層（BN和ReLU）得到3個特徵圖{B,C,D}.shape∈（CxHxW）,然後reshape為C×N，其中N=H×W。
矩陣C和B的轉置相乘，再通過softmax得到spatial attention map S(N×N)。
矩陣D和S的轉置相乘，reshape result到（CxHxW)再乘以尺度係數 α 再reshape為原來形狀,，最後與A相加得到最後的輸出E 其中α初始化為0，並逐漸的學習分配到更大的權重。可以看出E的每個位置的值是原始特徵每個位置的加權求和得到的。

矩陣的每一個元素為：

表示位置對位置j的影響。

中的每一個元素為：

為尺度因子，為的元素，為的元素；

簡圖

矩陣相當於一個attention，它的每一行計算的是，所有像素與某個像素之間的依賴關係，softmax概率化，softmax值越大，說明更可信，相對的依賴性也更強。

3.2.2 Channel Attention Module

每個高層次特徵的通道映射都可以看作是一個特定於類的響應，不同的語義響應相互關聯。通過探索通道映射之間的相互依賴關係，可以強調相互依賴的特徵映射，提高特定語義的特徵表示。

By exploiting the interdependencies between channel maps, we could emphasize interdependent feature maps and improve the feature representation of specific semantics.

在Channel Attention Module中，分別對A做reshape和reshape與transpose；
將得到的兩個特徵圖相乘再通過softmax得到channel attention map X (C×C)；
X與A做乘積再乘以尺度係數β再reshape為原來形狀，最後與A相加得到最後的輸出E。其中β初始化為0，並逐漸的學習分配到更大的權重。

矩陣的每一個元素為：

表示第通道對通道第的影響。

中的每一個元素為：

為尺度因子，為的元素，為的元素；

3.3.3 Attention Module Embedding with Networks

兩個attention module的輸出先求和再做一次卷積得到最後的預測特徵圖。 Specifically, we transform the outputs of two attention modules by a convolution layer and perform an element-wise sum to accomplish feature fusion. At last a convolution layer is followed to generate the final prediction map.
沒有採用級聯的操作，因為這樣需要更多的GPU，本文的attention modules非常簡單，可以直接插入到先用的FCN中，顯著地提高了效果且不增加非常多的參數。 We do not adopt cascading operation because it needs more GPU memory. Noted that our attention modules are simple and can be directly inserted in the existing FCN pipeline.

3.3 數據集：

Cityscapes: The dataset has 5,000 images captured from 50 different cities. Each image has 2048 × 1024 pixels, which have high quality pixel-level labels of 19 semantic classes. There are 2,979 images in training set, 500 images in validation set and 1,525 images in test set.
PASCAL VOC 2012: The dataset has 10,582 images for training, 1,449 images for validation and 1,456 images for testing, which involves 20 foreground object classes and one background class.
PASCAL Context: he dataset provides detailed semantic labels for whole scenes, which contains 4,998 images for training and 5,105 images for testing. In our paper,we evaluate the method on the most frequent 59 classes along with one background category (60 classes in total).

3.4 訓練過程：

poly學習率策略：每個epoch後，學習率乘以 $(1-frac{iter}{totaliter})^{0.9}$ ；
初始學習率為0.01、0.9動量、0.0001權重衰減；
對於CityScape,batchsize=8,其它為16
隨機crop、水平翻轉；

四、實驗結果

4.1 Ablation Study for Attention Modules

3. 數據集：

4.2 Ablation Study for Improvement Strategies

提升策略： DA: 隨機縮放 Multi-Grid: 在最後一個ResNet塊中應用了不同大小的網格層次結構(4,8,16) MS：多尺度測試 {0.5 0.75 1 1.25 1.5 1.75 2 2.2}

(1) DA: Data augmentation with random scaling.
(2) Multi-Grid: we apply employ a hierarchy of grids of different sizes (4,8,16) in the last ResNet block. (3) MS: We average the segmentation probability maps from 8 image scalesf0.5 0.75 1 1.25 1.5 1.75 2 2.2g for inference.

4.3 Visualization and Comparing with State-of-the-art

五、結論與思考

5.1 作者結論

提出了一種用於場景分割的雙注意網路(DANet)，該網路利用自注意機制自適應地融合局部語義特徵。同時，作者引入了Position attention module 和 Channel attention module 去捕獲空間和通道維度上的全局依賴關係。

5.2 記錄該工作的亮點，以及可以改進的地方

DANet在Cityscapes,Pascal VOC2012, Pascal Context和COCO Stuff上取得了傑出的成績，在未來，作者希望在減少計算複雜度，提高模型的魯棒性方面有相應的研究。

參考

? ? ? Thanks: ? ? ?

論文閱讀-Dual Attention Network for Scene Segmentation

Dual Attention Network for Scene Segmentation講解

論文閱讀 - Dual Attention Network for Scene Segmentation