R-Net

微軟亞研發的一篇機器閱讀理解的文章。這篇文章是ACL 2017 《Gated Self-Matching

Networks for Reading Comprehension and Question Answering》的後續進展。https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf?

www.microsoft.com

本文主要貢獻有兩點：

計算passage-question interaction的時候使用了門限注意力機制，可以提前屏蔽掉一些無關的信息

account for the fact that words in the passage are of different importance to answer a particular question for reading comprehension and question answering

然後考慮到passage往往都比較長，輸出層之前的LSTM可能無法很好整合整個passage的信息，所以對passage使用自注意力機制

However, recurrent networks can only memorize limited passage context in practice despite its theoretical capability. One answer candidate is often unaware of the clues in other parts of the passage. To address this problem, we propose a self-matching layer to dynamically refine passage representation with information from the whole passage

模型大概框架如下圖所示，主要包括4個部分：Encoding Layer，Gated Matching Layer，Self-Matching Layer，Boundary Prediction Layer，下面展開講。

Encoding Layer

$Q = lbrace w_t^Q brace _{t=1} ^m$

$P = lbrace w_t^P brace _{t=1} ^n$

先映射到向量空間，然後拼接詞向量和字向量做RNN編碼，本文使用了GRU，因為實驗效果差不多，還能節省點計算資源

$u_t^Q = ext{BiRNN}_Q(u_{t-1}^Q, [e_t^Q, c_t^Q])$

$u_t^P = ext{BiRNN}_P(u_{t-1}^P, [e_t^P, c_t^P])$

Gated Matching Layer

首先計算passage-question representation

$v_t^P = ext{RNN}(v_{t-1}^P, c_t)$

$s_j^t = ext{v}^ op ext{tanh}(W_u^Qu_j^Q + W_u^Pu_t^P + W_v^Pv_{t-1}^P)$ ，計算passage中第t個詞和question中第j個詞的相似度，除了用到這兩個詞的對應的representation之外，還用到passage中上一個詞的question-aware representation $v_{t-1}^P$

$a_i^t = ext{exp}(s_i^t)/sum_{j=1}^{m} ext{exp}(s_j^t)$ ，歸一化相似度

$c_t = sum_{i=1}^ma_i^tu_i^Q$ ，得到passage中第t個詞在question上的Attention

到目前為止都是Rockta ?schel et al. (2015)的做法，然後參考Wang & Jiang (2016a)直接引入來計算

$v_t^P = ext{RNN}(v_{t-1}^P, [u_t^P, c_t])$

然後在此基礎上引入了門限控制

$g_t = ext{sigmoid}(W_g[u_t^P, c_t])$

後續計算都用替換

Self-Matching Layer

$h_t^P = ext{BiRNN}(h_{t-1}^P, [v_t^P, c_t])$ ，此處和前文的不是一個

$s_j^t = ext{v}^ op ext{tanh}(W_v^Pv_j^P + W_v^{ ilde{P}}v_t^P)$ ，計算passage中第j個詞和第t個詞的相似度

$a_i^t = exp(s_i^t)/sum_{j=1}^nexp(s_j^t)$ ，歸一化相似度

$c_t = sum_{i=1}^na_i^tv_i^P$ ，得到passage中第t個詞的self-Attention

Boundary Prediction Layer

$s_j^t = ext{v}^ op ext{tanh}(W_h^Ph_j^P + W_h^ah_{t-1}^a)$ ，t = 0 計算起始點，t = 1 計算終止點

$a_i^t = ext{exp}(s_i^t)/sum_{j=a}^n ext{exp}(s_j^t)$ ，

$p^t = ext{argmax}(a_1^t, ..., a_n^t)$ ，

$c_t = sum_{i=1}^na_i^th_i^P$

$h_t^a = RNN(h_{t-1}^a, c_t)$

$h_{t-1}^a$ 初始化為question上的Attention ，具體計算如下：

$s_j = ext{v}^ op ext{tanh}(W_u^Qu_j^Q + W_v^QV_r^Q)$

$a_i = ext{exp}(s_i)/sum_{j = 1}^m ext{exp}(s_j)$

$r^Q = sum_{i=1}^ma_iu_i^Q$

這一塊本質上就是Pointer Network，就是初始化搞得比較複雜。

大概理一下計算順序能稍微清楚點，先計算作為的初始值，然後計算

和即為答案邊界起始點和終止點。

Result and Discussion

R-Net在SQuAD和MS-MARCO數據集上當時都取得了最好的成績。其中，R-NET (Wang et al., 2017)是提交ACL 2017的模型，R-NET (March 2017)是之後優化的一版，主要是在Output Layer之前加了幾層GRU進一步整合passage representation。

最後，作者在討論部分列舉了一些嘗試過的但是在SQuAD數據集上沒有效果的一些idea：

sentence ranking，考慮到答案是在某一個sentence中，所以考慮對sentence進行排序來幫助定位答案區間，無論是單獨訓練一個sentence ranking的模型還是聯合訓練sentence ranking，對最後結果都沒有提升，根據作者分析，最好的sentence model準確率才86%，而span prediction model在預測sentence時準確率已經92%了，所以沒用也不奇怪了。。
syntax information，嘗試加句法特徵、tree-LSTM以及依存關係分析，都沒有效果，這個就比較奇怪了。。

加入句法特徵，比如POS tags, NER results, linearized PCFG tree tags and dependency labels
在encoding層後面加入類似tree-LSTM的模塊，We use a multi-input LSTM to build hidden states following dependency tree paths in both top-down and bottom-up passes.
同時訓練依存關係語法分析