BiDAF

ICLR 2017 一篇关于机器阅读理解的文章。

BiDAF?

allenai.github.io

注意力机制的提出使得机器阅读理解和问答在很多任务上都取得了不错的成绩，但此前的文章中注意力机制的使用通常存在以下3个特点：

使用注意力机制将context summarize成一个固定大小的表达；
计算注意力的时候往往是时序上相关的，当前时间注意力的计算依赖上一时刻的注意力的结果；
注意力的计算通常是单向的，一般都是query attends on context paragraph or the image

所以这篇文章提出了BiDAF，同时计算contex2query和query2context的注意力，而且注意力的计算在时序上是独立的，并且会flow到下一层，避免了过早summary造成的信息丢失。模型大概结构如下图所示，下面展开讲。

Char Embedding Layer

使用CNN训练得到，参考Kim(2014)

Word Embedding Layer

使用GloVe预训练的词向量。

把词向量和字向量拼接后过一个两层的Highway Network，输出context $m{ ext{X}} in mathbb{R}^{d imes{T}}$ 和query $m{ ext{Q}} in mathbb{R}^{d imes{J}}$

Contextual Embedding Layer

使用Bi-LSTM整合Highway Network的输出，能更好的表达词和词之间在句子中的关系。输出输出context $m{ ext{H}} in mathbb{R}^{2d imes{T}}$ 和query $m{ ext{U}} in mathbb{R}^{2d imes{J}}$

Attention Flow Layer

先计算context和query词和词两两之间的相似性

$S_{tj} = alpha(m{H}_{:t}, m{U}_{:j}) in mathbb{R}$ ，其中可训练，本文中 $alpha(m{h}, m{u}) = m{ ext{w}_{(S)}}^ op[m{h}; m{u}; m{h} odot m{u}]$

然后计算context-to-query attention，对于每个context中的词，按attention系数计算query中词的加权和作为当前词的query aware representation

$m{a}_t = ext{softmax}(m{S}_{t:}) in mathbb{R}^J$

$ilde{m{U}}_{:t} = sum_jm{a}_{tj}m{U}_{:j} in mathbb{R}^{2d imes{J}}$

然后计算query-to-context attention，计算query和每个context词的最大相似度，query和context词的相似度是query所有词里面和context词相似度最大的。得到query和每个词的最大相似度后，最后按attention系数计算context中词的加权和。

$m{b} = ext{softmax}( ext{max}_{col}(m{S}))$

$ilde{m{h}} = sum_tm{b}_tm{ ext{H}}_{:t} in mathbb{R}^{2d}$

将 tile T次得到 $ilde{m{ ext{H}}} in mathbb{R}^{2d imes{T}}$

然后将context的几个representation组合起来得到最终的query-aware representation

$m{ ext{G}}_{:t} = eta(m{ ext{H}}_{:t}, ilde{m{ ext{U}}}_{:t}, ilde{m{ ext{H}}}_{:t})$ ，其中可训练，本文中 $eta(m{ ext{h}}, ilde{m{ ext{u}}}, ilde{m{ ext{h}}}) = [m{ ext{h}}; ilde{m{ ext{u}}}; m{ ext{h}} odot ilde{m{ ext{u}}}; m{ ext{h}} odot ilde{m{ ext{h}}} ] in mathbb{R}^{8d imes{T}}$

Modeling Layer

直接过一个BiLSTM得到，这和Contextual Encoding Layer结构是一样的，但这里的不仅整合了context中词和词之间的相关性，也包含了query的信息。

Output Layer

$m{ ext{p}}^1 = ext{softmax}(m{ ext{w}}_{(p^1)}^ op[m{ ext{G}}; m{ ext{M}} ])$

$m{ ext{p}}^2 = ext{softmax}(m{ ext{w}}_{(p^2)}^ op[m{ ext{G}}; m{ ext{M}}^2 ])$ ，其中 $m{ ext{M}}^2$ 是经过另一个BiLSMT的结果

$extit{L}( heta) = -frac{1}{N}sum_i^N ext{log}(m{ ext{p}}^1_{y_i^1}) + ext{log}(m{ ext{p}}^2_{y_i^2})$

Experiment and Result

在SQuAD数据集和cloze-style(CNN/Daily Mail)数据集上评测效果在当时都是最好的。

Visualization and Discussion

为了展示Hierarchical Encoding的效果，作者在最后分析了Word Embedding层和Contextual Embedding层词和词之间的相似性。比如，在Word Embedding层When的相似词都是when、after、before这些，但到了Contextual Embedding层，When的相似词就出现了很多年份数字，这也说明Contextual Embedding可整合上下文信息，为后续的Attention Flow层提供了一个较好的representation。

At the word embedding layer, query words such as When, Where and Who are not well aligned to possible answers in the context, but this dramatically changes in the contextual embedding layer which has access to context from surrounding words and is just 1 layer below the attention layer.

进一步举了may和May的例子，在Word Embedding层，may和May比较近，在Contextual Embedding层则相隔较远。

An interesting pattern emerges in the Word space, where May is separated from the rest of the months because May has multiple meanings in the English language. The contextual embedding layer uses contextual cues from surrounding words and is able to separate the usages of the word May.

作者还对Attention Flow的效果进行了可视化，可以看到，第一个例子中，和where匹配较好的是一些地点名词，第二个例子中，和many匹配较好的是一些数量词，而且，实体名词匹配较好的都是自身，这都为后续找答案提供了丰富的特征。