BiDAF

ICLR 2017 一篇關於機器閱讀理解的文章。

BiDAF?

allenai.github.io

注意力機制的提出使得機器閱讀理解和問答在很多任務上都取得了不錯的成績，但此前的文章中注意力機制的使用通常存在以下3個特點：

使用注意力機制將context summarize成一個固定大小的表達；
計算注意力的時候往往是時序上相關的，當前時間注意力的計算依賴上一時刻的注意力的結果；
注意力的計算通常是單向的，一般都是query attends on context paragraph or the image

所以這篇文章提出了BiDAF，同時計算contex2query和query2context的注意力，而且注意力的計算在時序上是獨立的，並且會flow到下一層，避免了過早summary造成的信息丟失。模型大概結構如下圖所示，下面展開講。

Char Embedding Layer

使用CNN訓練得到，參考Kim(2014)

Word Embedding Layer

使用GloVe預訓練的詞向量。

把詞向量和字向量拼接後過一個兩層的Highway Network，輸出context $m{ ext{X}} in mathbb{R}^{d imes{T}}$ 和query $m{ ext{Q}} in mathbb{R}^{d imes{J}}$

Contextual Embedding Layer

使用Bi-LSTM整合Highway Network的輸出，能更好的表達詞和詞之間在句子中的關係。輸出輸出context $m{ ext{H}} in mathbb{R}^{2d imes{T}}$ 和query $m{ ext{U}} in mathbb{R}^{2d imes{J}}$

Attention Flow Layer

先計算context和query詞和詞兩兩之間的相似性

$S_{tj} = alpha(m{H}_{:t}, m{U}_{:j}) in mathbb{R}$ ，其中可訓練，本文中 $alpha(m{h}, m{u}) = m{ ext{w}_{(S)}}^ op[m{h}; m{u}; m{h} odot m{u}]$

然後計算context-to-query attention，對於每個context中的詞，按attention係數計算query中詞的加權和作為當前詞的query aware representation

$m{a}_t = ext{softmax}(m{S}_{t:}) in mathbb{R}^J$

$ilde{m{U}}_{:t} = sum_jm{a}_{tj}m{U}_{:j} in mathbb{R}^{2d imes{J}}$

然後計算query-to-context attention，計算query和每個context詞的最大相似度，query和context詞的相似度是query所有詞裡面和context詞相似度最大的。得到query和每個詞的最大相似度後，最後按attention係數計算context中詞的加權和。

$m{b} = ext{softmax}( ext{max}_{col}(m{S}))$

$ilde{m{h}} = sum_tm{b}_tm{ ext{H}}_{:t} in mathbb{R}^{2d}$

將 tile T次得到 $ilde{m{ ext{H}}} in mathbb{R}^{2d imes{T}}$

然後將context的幾個representation組合起來得到最終的query-aware representation

$m{ ext{G}}_{:t} = eta(m{ ext{H}}_{:t}, ilde{m{ ext{U}}}_{:t}, ilde{m{ ext{H}}}_{:t})$ ，其中可訓練，本文中 $eta(m{ ext{h}}, ilde{m{ ext{u}}}, ilde{m{ ext{h}}}) = [m{ ext{h}}; ilde{m{ ext{u}}}; m{ ext{h}} odot ilde{m{ ext{u}}}; m{ ext{h}} odot ilde{m{ ext{h}}} ] in mathbb{R}^{8d imes{T}}$

Modeling Layer

直接過一個BiLSTM得到，這和Contextual Encoding Layer結構是一樣的，但這裡的不僅整合了context中詞和詞之間的相關性，也包含了query的信息。

Output Layer

$m{ ext{p}}^1 = ext{softmax}(m{ ext{w}}_{(p^1)}^ op[m{ ext{G}}; m{ ext{M}} ])$

$m{ ext{p}}^2 = ext{softmax}(m{ ext{w}}_{(p^2)}^ op[m{ ext{G}}; m{ ext{M}}^2 ])$ ，其中 $m{ ext{M}}^2$ 是經過另一個BiLSMT的結果

$extit{L}( heta) = -frac{1}{N}sum_i^N ext{log}(m{ ext{p}}^1_{y_i^1}) + ext{log}(m{ ext{p}}^2_{y_i^2})$

Experiment and Result

在SQuAD數據集和cloze-style(CNN/Daily Mail)數據集上評測效果在當時都是最好的。

Visualization and Discussion

為了展示Hierarchical Encoding的效果，作者在最後分析了Word Embedding層和Contextual Embedding層詞和詞之間的相似性。比如，在Word Embedding層When的相似詞都是when、after、before這些，但到了Contextual Embedding層，When的相似詞就出現了很多年份數字，這也說明Contextual Embedding可整合上下文信息，為後續的Attention Flow層提供了一個較好的representation。

At the word embedding layer, query words such as When, Where and Who are not well aligned to possible answers in the context, but this dramatically changes in the contextual embedding layer which has access to context from surrounding words and is just 1 layer below the attention layer.

進一步舉了may和May的例子，在Word Embedding層，may和May比較近，在Contextual Embedding層則相隔較遠。

An interesting pattern emerges in the Word space, where May is separated from the rest of the months because May has multiple meanings in the English language. The contextual embedding layer uses contextual cues from surrounding words and is able to separate the usages of the word May.

作者還對Attention Flow的效果進行了可視化，可以看到，第一個例子中，和where匹配較好的是一些地點名詞，第二個例子中，和many匹配較好的是一些數量詞，而且，實體名詞匹配較好的都是自身，這都為後續找答案提供了豐富的特徵。