Attention系列閱讀筆記一（Attention NMT）

文章標題

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

鏈接

https://arxiv.org/pdf/1409.0473.pdf

0. Motivation

這篇文章試圖解決的問題是，原來的機器翻譯都是一一對應的硬分割，這篇文章試圖讓模型自動找到詞與詞之間軟的對應關係。

1. Introduction解讀

指出了傳統encoder－decoder架構存在的問題，將source sentence壓縮到定長的向量的過程會使得難以處理超長的輸入，而且我覺得這裡會存在信息的丟失。文中也指出句子越長，效果會越差。

因此，本文提出了在翻譯的過程中，每decode出一個詞的時候，就去尋找source sentence中和它相關的一系列詞。這樣做的好處在於，不用把source sentence壓縮到定長的向量中去了，decode的時候也是一個動態的過程。這樣就能更好的處理長句子了。

2. Background解讀

2.1 RNN Encoder－Decoder結構

常用的架構

Encoder：

$h_t = f(x_t , h_{t-1}) quadquad(1)$

$c=q({h_1,...h_{Tx}})$

其中是時間片t時的詞向量，是時間片t時的hidden state。一般都是非線性函數。

Decoder：

利用前t－1個詞和上下文向量c，預測第t個詞。

$p(y)=prod_{t=1}^Tp(y_t|{y_1,...,y_{t-1}},c}quad(2)$

其中 $y=(y_1,...y_{Ty})$ 。

利用RNN來表示條件概率：

$p(y_t|{y_1,...y_{t-1}},c)=g(y_{t-1},s_t,c)quad(3)$

其中是非線性函數，是RNN的hidden state。

3. Learning to align and translate（模型關鍵部分）

Encoder:

Bidirectional RNN

Decoder:

a decoder that emulates searching through a source sentence during decoding a translation。(不太好翻譯就把原文貼上來了)本質是一個動態搜索source sentence來解碼的東東。

給定X生成y_t的過程

3.1 Decoder解析

把式（2）中的每個條件概率都變成

$p(y_i|y_1,...,y_{i-1},x)=g(y_{i-1},s_i,c_i)quad(4)$

其中是RNN在時間片i時的hidden state。（為了和論文中保持一致，這裡就用了i，和上文t意思是一樣的）

$s_i = f(s_{i-1},y_{i-1},c_i)$

注意：式(3)和式(4)的區別就在於一個用的 ，一個用的，這裡的相當於每個都有一個專屬的上下文。

由的加權得到：

$c_i = prod_{j=1}^{T_x}alpha_{ij}h_jquad(5)$

權重 $alpha_{ij}$ ，相當於是target word 來自source word 的概率，就相當於是一個期望:

$alpha_{ij}=frac{exp(e_{ij})}{sum_{k=1}^{T_x}exp(e_ik)}quad(6)$

其中 $e_{ij}$ 用來度量輸入位置和輸出位置的匹配程度:

$e_{ij}=a(s_{i-1},h_j)$

$s_{i-1}$ 和的含義如圖所示。

這裡的alignment model 直接就是一個前饋NN，這樣在訓練的過程中可以直接用BP得到的梯度來更新。

這裡的加權就相當於是一個attention的操作。

3.2 Encoder解析

Bidirectional RNN

forward hidden state $(overrightarrow{h}_1, ...,overrightarrow{h}_{T_x})$

backward hidden state $(overleftarrow{h}_1, ...,overleftarrow{h}_{T_x})$

拼接得到 $h_j=[overrightarrow{h}_j^T;overleftarrow{h}_j^T]^T$

4. 實驗

數據集：ACL WMT』14

預處理：a usual tokenization, no other special preprocessing, such as lowercasing or stemming

模型：根據句子長度分為30，50兩種，根據NN結構分為兩種，分別是文中提出的RNNsearch，對比實驗RNNencdec

結構設計

RNNsearch：1000 hidden units

RNNencdec：1000 hidden units

In both cases, we use a multilayer network with a single maxout (Goodfellow et al., 2013) hidden layer to compute the conditional probability of each target word (Pascanu et al., 2014)

優化演算法：SGD + Adadelta (minibatch = 80)

Once a model is trained, we use a beam search to find a translation that approximately maximizes the conditional probability (see, e.g., Graves, 2012; Boulanger-Lewandowski et al., 2013). Sutskever et al. (2014) used this approach to generate translations from their neural machine translation model

A 模型結構

參考代碼

bentrevett/pytorch-seq2seq

這個代碼注釋非常詳細,很值得學習。本文代碼都來自這個鏈接。

A.1.1 Decoder部分(這一部分就是直接用框架的gru就代替了)

state of RNN： $s_i=f(s_{i-1},y_{i-1},c_i)=(1-z_i)circ s_{i-1}+z_icirc widetilde{s_i}$

是update gate的輸出，updated state $widetilde{s_i}=tanh(We(y_{i-1})+U[r_icirc s_{i-1}]+Cc_i)$

$e(y_{i-1})$ 是詞 $y_{i-1}$ 的m維embedding，

是reset gates輸出。

update gates 用來保留前一時刻的值。 $z_i=sigma(W_ze(y_{i-1})+U_zs_{i-1}+C_zc_i),$

reset gates 用來控制保留多少前一時刻的信息。 $r_i=sigma(W_re(y_{i-1})+U_rs_{i-1}+C_rc_i)$

At each step of the decoder, we compute the output probability (Eq. (4)) as a multi-layered function (Pascanu et al., 2014). We use a single hidden layer of maxout units (Goodfellow et al., 2013) and normalize the output probabilities (one for each word) with a softmax function (see Eq. (6)).

A.1.2 Alignment Model

對於輸入輸出對 ,需要計算次，為了減少計算，使用一個單層MLP。

$a(s_{i-1},h_j)=v_a^Ttanh(W_as_{i-1}+U_ah_j)$

其中 $W_ain mathbb{R}^{n×n},U_ain mathbb{R}^{n×2n},v_ainmathbb{R}^n$ 都是權重向量，不依賴於，因此可以提前計算，來減小計算量。

注意這一段的代碼實現，換成向量的形式是

$E_t = tanh(attn(s_{t-1},H))$

$widehat{a}_t=vE_t$

$a_t =softmax(widehat{a}_t)$

這裡的attn用linear layer

A.2 模型細節

A.2.1 Encoder(這一部分公式直接用框架裏的gru就可以代替了)

輸入： $x=(x_1,...,x_{T_x}),x_iinmathbb{R}^{K_x}$

輸出： $y=(y_i,...,y_{T_y}),y_iinmathbb{R}^{K_y}$

其中分別是source，target 語言的詞庫大小。分別是source，target句子的長度。

首先進行BiRNN的前向計算：

$egin{equation} overrightarrow{h}_i=left{ egin{aligned} （1-overrightarrow{z}_i)circoverrightarrow{h}_{i-1}+overrightarrow{z}_icircoverrightarrow{underline{h}}_i & , & if quad i >0, \ 0 & , & if quad i = 0. \ end{aligned}<br /> ight. end{equation}$

其中

$overrightarrow{underline{h}}_i=tanh(overrightarrow{W}overline{E}(x_i)+overrightarrow{U}[overrightarrow{r}_icircoverrightarrow{h}_{i-1}])$

$overrightarrow{z}_i=sigma(overrightarrow{W}_zoverline{E}(x_i)+overrightarrow{U}_zoverrightarrow{h}_{i-1})$

$overrightarrow{r}_i=sigma(overrightarrow{W}_roverline{E}(x_i)+overrightarrow{U}_roverrightarrow{h}_{i-1})$

符號：

$overline{E}inmathbb{R}^{m×K_x}$ 是word embedding matrix

$overrightarrow{W},overrightarrow{W}_z,overrightarrow{W}_rinmathbb{R}^{n×m},overrightarrow{U},overrightarrow{U}_z,overrightarrow{U}_rinmathbb{R}^{n×n},$ 都是權重矩陣

是word embedding維數

是hidden units個數

class Encoder(nn.Module): def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout): super().__init__()

self.input_dim = input_dim
self.emb_dim = emb_dim
self.enc_hid_dim = enc_hid_dim
self.dec_hid_dim = dec_hid_dim
self.dropout = dropout

self.embedding = nn.Embedding(input_dim, emb_dim)

self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)

self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)

self.dropout = nn.Dropout(dropout)

def forward(self, src):

#src = [src sent len, batch size]

embedded = self.dropout(self.embedding(src))

#embedded = [src sent len, batch size, emb dim]

outputs, hidden = self.rnn(embedded)

#outputs = [sent len, batch size, hid dim * num directions]
#hidden = [n layers * num directions, batch size, hid dim]

#hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
#outputs are always from the last layer

#hidden [-2, :, : ] is the last of the forwards RNN
#hidden [-1, :, : ] is the last of the backwards RNN

#initial decoder hidden is final hidden state of the forwards and backwards encoder RNNs fed through a linear layer
# 把h拼接後，經過一層全連接後，用tanh激活;這一步的操作和論文中的不一樣。
# 論文中的操作是下面計算s0的公式
hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)))

#outputs = [sent len, batch size, enc hid dim * 2]
#hidden = [batch size, dec hid dim]

return outputs, hidden

A.2.2 Decoder

和A.1.1類似

下面是上面沒介紹過的部分

初始

$s_0=tanh(W_soverleftarrow{h}_1)$

$c_i=sum_{j=1}^{T_x}alpha_{ij}h_j \$

其中

$alpha_{ij}=frac{exp(e_{ij})}{sum_{k=1}^{T_x}exp(e_{ik})}$

$e_{ij}=v_a^Ttanh(W_as_{i-1}+U_ah_j)$

這裡算權重的時候，我們來解讀一下，對於每一個i，我們會遍歷j，來計算權重。

符號：

$v_ainmathbb{R}^{n},W_ainmathbb{R}^{n×n},U_ainmathbb{R}^{n×2n},$ 都是權重向量。

把固定為 $overrightarrow{h}_{T_x}$ ，模型就退化為經典版RNN Encoder-Decoder結構了。

class Attention(nn.Module): def __init__(self, enc_hid_dim, dec_hid_dim): super().__init__()

self.enc_hid_dim = enc_hid_dim
self.dec_hid_dim = dec_hid_dim

self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
self.v = nn.Parameter(torch.rand(dec_hid_dim))

def forward(self, hidden, encoder_outputs):

#hidden = [batch size, dec hid dim]
#encoder_outputs = [src sent len, batch size, enc hid dim * 2]

batch_size = encoder_outputs.shape[1]
src_len = encoder_outputs.shape[0]

#repeat encoder hidden state src_len times
hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)

encoder_outputs = encoder_outputs.permute(1, 0, 2)

#hidden = [batch size, src sent len, dec hid dim]
#encoder_outputs = [batch size, src sent len, enc hid dim * 2]
#計算e_ij的後面一部分
energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))

#energy = [batch size, src sent len, dec hid dim]

energy = energy.permute(0, 2, 1)

#energy = [batch size, dec hid dim, src sent len]

#v = [dec hid dim]

v = self.v.repeat(batch_size, 1).unsqueeze(1)

#v = [batch size, 1, dec hid dim]
#乘v的作用主要是改變權重的維度，使其成為長度為Tx的權重向量
attention = torch.bmm(v, energy).squeeze(1)

#attention= [batch size, src len]
# softmax 算權重
return F.softmax(attention, dim=1)

$p(y_i|s_i,y_{i-1},c_i)propto(y_i^TW_ot_i)$

其中：

$t_i = [max{widetilde{t}_{i,2j-1},widetilde{t}_{i,2j}}]_{j=1,...l}^T$

$widetilde{t}_{i,k}$ 是向量 $widetilde{t}_i$ 的第k個值，

$widetilde{t}_i=U_os_{i-1}+V_oE{y_{i-1}}+C_oc_i$

符號：

$W_oinmathbb{R}^{K_y×l},U_oinmathbb{R}^{2l×n},V_oinmathbb{R}^{2l×m},C_oinmathbb{R}^{2l×2n},$ 都是權重矩陣。

This can be understood as having a deep output (Pascanu et al., 2014) with a single maxout hidden layer。

class Decoder(nn.Module): def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention): super().__init__()

self.emb_dim = emb_dim
self.enc_hid_dim = enc_hid_dim
self.dec_hid_dim = dec_hid_dim
self.output_dim = output_dim
self.dropout = dropout
self.attention = attention

self.embedding = nn.Embedding(output_dim, emb_dim)

self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)

self.out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)

self.dropout = nn.Dropout(dropout)

def forward(self, input, hidden, encoder_outputs):

#input = [batch size]
#hidden = [batch size, dec hid dim]
#encoder_outputs = [src sent len, batch size, enc hid dim * 2]

input = input.unsqueeze(0)

#input = [1, batch size]

embedded = self.dropout(self.embedding(input))

#embedded = [1, batch size, emb dim]
# 得到attention層的權重，在論文中的符號為 alpha
a = self.attention(hidden, encoder_outputs)

#a = [batch size, src len]

a = a.unsqueeze(1)

#a = [batch size, 1, src len]
# 得到encoder層的輸出，在論文中的符號為h
encoder_outputs = encoder_outputs.permute(1, 0, 2)

#encoder_outputs = [batch size, src sent len, enc hid dim * 2]
# 公式(5)的矩陣版本
weighted = torch.bmm(a, encoder_outputs)

#weighted = [batch size, 1, enc hid dim * 2]

weighted = weighted.permute(1, 0, 2)

#weighted = [1, batch size, enc hid dim * 2]

rnn_input = torch.cat((embedded, weighted), dim=2)

#rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
# 這就是圖片中Decoder的步驟，過一個rnn
output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))

#output = [sent len, batch size, dec hid dim * n directions]
#hidden = [n layers * n directions, batch size, dec hid dim]

#sent len, n layers and n directions will always be 1 in this decoder, therefore:
#output = [1, batch size, dec hid dim]
#hidden = [1, batch size, dec hid dim]
#this also means that output == hidden
assert (output == hidden).all()

embedded = embedded.squeeze(0)
output = output.squeeze(0)
weighted = weighted.squeeze(0)
# 這是經過f函數的步驟
output = self.out(torch.cat((output, weighted, embedded), dim=1))