原博文:

The Illustrated Transformer?

jalammar.github.io

KIRA:草稿紙上的Transformer?

zhuanlan.zhihu.com
圖標

The Transformer– a model that uses attention to boost the speed with which these models can be trained.

A High-Level Look

Let』s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

Popping open that black box, we see an encoding component, a decoding component, and connections between them.

The encoding component is a stack of encoders The decoding component is a stack of decoders of the same number.

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

The encoders inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.

Bringing The Tensors In

Now that we』ve seen the major components of the model, lets start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

Each word is embedded into a vector of size 512. Well represent those vectors with these simple boxes.

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that』s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Next, we』ll switch up the example to a shorter sentence and well look at what happens in each sub-layer of the encoder.

Encoding

As we』ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a 『self-attention』 layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

If youre familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it』s processing. Self-attention is the method the Transformer uses to bake the 「understanding」 of other relevant words into the one we』re currently processing.

Self-Attention in Detail

  • The first step in calculating self-attention is to create three vectors from each of the encoder』s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They dont HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

  • The second step in calculating self-attention is to calculate a score. Say were calculating the self-attention for the first word in this example, 「Thinking」. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word were scoring. So if were processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

  • The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so theyre all positive and add up to 1.

This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes its useful to attend to another word that is relevant to the current word.

  • The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
  • The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network.

The Beast with Many Heads

The paper further refined the self-attention layer by adding a mechanism called 「multi-headed」 attention. This improves the performance of the attention layer in two ways:

  1. It expands the models ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if were translating a sentence like 「The animal didnt cross the street because it was too tired」, we would want to know which word 「it」 refers to.
  2. It gives the attention layer multiple 「representation subspaces」. As well see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it』s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.

Representing the order of the sequence using positional encoding

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once theyre projected into Q/K/V vectors and during dot-product attention.

傳統Encoder-Decoder框架中,採用RNN系列的網路(LSTM/GRU)來作為Encoder和Decoder部分的主要工具,利用Attention機制來保證兩部分的文本對齊等問題。看似已經較好的解決了大多數序列到序列(seq2seq)的問題,然而RNN系列的一個天生侷限就是其時序性,以文本序列為例,我們必須要先計算出t時刻的隱狀態信息,纔能夠繼續計算下一時刻的相應信息,這就導致大規模數據下模型整體效率比較低,難以實現計算的並行化。在這個背景下,Transformer應運而生,就像其論文的名字一樣,將Encoder和Decoder部分的RNN系列全部替換為Attention,使得模型變成了Attention is all you need。一、Attention機制的一般形式及self-Attention

  • Attention的一般形式

為了更統一的描述各個部分的Attention機制,我們首先對過去各種各樣的Attention做一個形式上的抽象。回想之前提到過的seq2seq中的Attention機制是如何進行的呢?也就是對於待解碼的詞向量 E(x_{i}) ,需要考慮它上一時刻的隱狀態 s_{i-1} ,與Encoder部分的每一個隱狀態 h_{j} 進行比較,通過某個打分函數 a(cdot) 得到權重 a_{ij}=a(s_{i-1,h_{j}}) ,再將這j個權重賦給j個Encoder的隱狀態 h_{j} 得到加權求和的向量 c_{i} = sum_{j}{a_{ij}h_{j}}

在上面這個過程中,不難發現實際上操作的有三類變數,第一類是需要與一個變數集合逐一比較的,我們稱為query,簡記為q。而那個集合裏每一個變數作為第二類變數,是要與q進行比較的,稱為key,簡記為k。而最後一類變數是被權重 a_{ij} 對應j位置賦權的,稱為value,簡記為v(這樣定義,很容易發現key和value是一一對應的,因為第j個需要和query比較的key,得到的權重要賦予給第j個value)。這樣描述起來很抽象,與上述具體的過程對應起來,就是 q=s_{i-1} , k=v=h_{j} ,這樣理解起來就形象多了。

有了這種抽象的表述,將每一個q,k,v以矩陣形式表述,可以表示為:

Q=(q_1,q_2,...,q_m)^{T}K=(k_1,k_2,..,k_n)^{T} , V=(v_1,v_2,..,v_n)^T

那Attention的一般形式可以被表述為:

Attention(Q,K,V)=(a(Q_1,K)V,a(Q_2,K)V,..a(Q_m,K)V)

下面就來說最常用的打分函數,在上一篇專欄文章裡面的打分函數用的是加性函數 a(q_i,k_j)=softmax(u_a^{T}tanh(w_qq_i+w_kk_j))

而在Transformer的論文中採用了乘性打分函數:

a(q_i,k_j)=softmax(q_icdot k_j) ,也就是 對q_ik_j 的內積計算softmax值。

此時公式可改寫為:

Attention(Q,K,V)=softmax(QK^{T})V

同時文中做了一個小的縮放操作,原因是單純的做q與k的內積會導致 QK^{T} 內的元素方差較大,以至於將softmax送到梯度較小的區域之內(我將試圖在【注2】中對這件事進行解釋),不利於模型參數的更新,所以將公式進行縮放操作,及在 QK^{T} 的基礎上除以 sqrt{d_k} ,(d_k為q,k向量的維度 )。最終形成論文提及的Attention形式:

Attention(Q,K,V)=softmax(frac{QK^{T}}{sqrt{d_k}})V

以下關於Attention的操作將根據公式來進行。

  • self-Attentiion

Transformer的核心思想就是用Attention來替代原來Encoder和Decoder部分的RNN,先前RNN的主要作用就是將當前時刻輸入的詞向量信息與前一時刻隱狀態進行整合,形成該時刻的隱狀態信息,換句話說,就是將詞向量信息抽象為可供模型使用的高層次信息。那麼如何只利用Attention來完成這個過程呢?

圖1 self-Attention圖示

還拿上篇文章的「我」「很」「帥」為例,由圖1所示,對於「我」這個詞來說,將其詞向量作為query向量q,將"我很帥"及句子的起始和終止符號每個詞的詞向量都作為待對比的key,value向量取與key相同。設通過Attention操作後的向量為h,那麼根據(3)式:

h(我)= softmax(frac{E(我)  egin{pmatrix} E(<s>),&  E(<我>),& E(<很>),& E(<帥>),& E(</s>)end{pmatrix}}{sqrt{d_k}})egin{pmatrix} E(<s>)\ E(<我>)\ E(<很>)\E(<帥>)\E(</s>)&end{pmatrix} \=softmax(frac{E(我)E(<s>)}{sqrt{d_k}})E(<s>)+softmax(frac{E(我)E(<我>)}{sqrt{d_k}})E(<我>)+softmax(frac{E(我)E(<很>)}{sqrt{d_k}})E(<很>)\+softmax(frac{E(我)E(<帥>)}{sqrt{d_k}})E(<帥>)+softmax(frac{E(我)E(</s>)}{sqrt{d_k}})E(</s>)

結合上式和圖1可以看出,對「我」這個詞整合成隱層信息的時候,是將「我」這個詞與其所在句子中的每一個詞進行比較,考查其在句子中的表達更注重哪些位置的信息,從而分配不同的權重給對應位置的詞進行加權求和,這也符合Attention的基本思想。由於所有的辭彙的Attention操作都是在其句子自身進行的,因此這種Attention機制被稱為self-Attention(自注意力機制)。

Transformer將Encoder和Decoder的主要部分都替換成了這樣的self-Attention,優點很明顯,所有待處理的向量在矩陣運算下可以很好的並行處理,加快了以前Encoder和Decoder的效率。但還有一個問題待解決,那就是在RNN當中可以很好的保留詞與詞之間的序列順序,現在通過self-Attention並行化以後,這個信息難以被描述出來,在Transformer論文原文中,引入了位置向量來標記每個詞所在的位置信息。

二、位置向量和多頭注意力機制

  • 位置向量

在論文原文中,作者引入了位置向量PE(pos),並與輸入的詞向量進行求和,之後輸入對應的下一層,這就要求位置向量應該與詞向量具有相同的維度 d_{model} ,並在奇偶性不同的分量位置上設計了不同的取值,如下:

在偶數位分量,有 PE(pos,2i)=sin(frac{pos}{10000^{frac{2i}{d_{model}}}})

在奇數位分量,有 PE(pos,2i+1)=cos(frac{pos}{10000^{frac{2i}{d_{model}}}})

其中,pos為當前詞語在句子中的位置(1,2,3,4....),i為位置向量的分量位置

可能這看起來有點匪夷所思,但如果從位置的偏移角度來看,解釋起來就相對合理了。假設我們要計算pos位置移動k位置的位置向量,也就是 PE(pos+k) ,以偶數位分量為例,有:

PE(pos+k,2i)=sin(frac{pos+k}{10000^{frac{2i}{d_{model}}}})

=sin(frac{pos}{10000^{frac{2i}{d_{model}}}})cos(frac{k}{10000^{frac{2i}{d_{model}}}})+cos(frac{pos}{10000^{frac{2i}{d_{model}}}})sin(frac{k}{10000^{frac{2i}{d_{model}}}})

=PE(pos,2i)PE(k,2i+1)+PE(pos,2i+1)PE(k,2i)

可見,在pos位置偏移k個位置以後的位置向量可以表示為pos位置向量的線性組合,使得位置的相對信息有了一定體現。

  • 多頭注意力機制

在引入了位置向量PE(pos)以後,我們將位置向量與詞向量進行求和,輸入到self-Attention中進行抽象信息的整合。多數情況下,單一的self-Attention難以捕獲序列信息的多樣性。因此考慮多個相同操作的self-Attention平行的提取每個詞語的信息,然後再將多個Attention的結果拼接起來,用於後續層次的操作。這點類似於卷積神經網路中多個卷積核同時作用一個矩陣對象的想法,都是試圖取實現信息不同角度的多樣化採集。

  • 殘差邊與Layer-Normalization

殘差邊(short-cut)在圖像的resnet中已經被廣泛應用了,它的出現是為了防止網路層數的加深導致網路內的參數退化。其想法是將低層的特徵躍過一些網路層直接送進高層網路,這就保證了網路的最差情況也能學到那個低層特徵本身。原因是過去的網路是計算 l 層輸入 x(l) 的線性與非線性的變換 z^{(l+1)} = sigma(wx^{(l)}+b) ,而殘差邊的思想是將低層的(比如 l-1x^{(l-1)} ))特徵直接送入 l 層參與運算z^{(l+1)} = sigma(wx^{(l)}+b+x^{(l-1)}) ,這樣一來,即使參數w,b都退化為0,也還能留下低層特徵的信息送到高層,不至於讓信息在深層網路傳遞的過程中丟失過多。

另一個操作Layer-Normalization(LN)則是起到與Batch-Normalization(BN)類似的作用,BN是指在批次這個維度來標準化數據,使得其落在梯度適中的區域。LN則是在某一層輸出時對所有的神經元的值做標準化,好處是不用依賴Batch大小的設置,也能起到一定的加速收斂的作用(其實LN的理解我還不是很到位,以後理解的更好時會更新這一塊內容)。

  • Masking操作

masking操作在論文原文中介紹的相對簡潔,作者認為序列在解碼的過程中,對於 i 時刻的解碼,只能依賴到 i 之前時刻的詞語信息,以保證解碼過程中的有序性。關於masking的具體做法,主要是設置一個與序列長度相同維度的mask向量,將其第 i 個分量以後的分量全部置為0,第 i 個分量以前的分量全部置為1。用於標記解碼序列哪些參與網路內的運算。

推薦閱讀:

相關文章