為什麼bert的詞向量是動態的，與word2vec的區別是什麼？

bert是怎麼生成動態詞向量的？它與word2vec的區別和聯繫是什麼？

先說word2vec，無論cbow還是skip-gram，所有參數是輸入輸出兩個矩陣（伺服器故障貼不了圖）。訓練得到的詞向量即矩陣，實際inference就是對該矩陣的embedding_lookup，對同一個詞得到的詞向量肯定是不變的，所以是靜態的。

而bert的inference過程會複雜很多，會利用到上下文的信息經過transformer編碼（即self-attention交互），同一個詞如果上下文信息不同，那麼就會得到不同的embedding，所以是動態的。

可以結合bert代碼來看，會有直觀的認識：https://github.com/google-research/bert/blob/master/modeling.py#L156

with tf.variable_scope(scope, default_name="bert"): with tf.variable_scope("embeddings"): # Perform embedding lookup on the word ids. (self.embedding_output, self.embedding_table) = embedding_lookup( input_ids=input_ids, vocab_size=config.vocab_size, embedding_size=config.hidden_size, initializer_range=config.initializer_range, word_embedding_name="word_embeddings", use_one_hot_embeddings=use_one_hot_embeddings)

# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout.
self.embedding_output = embedding_postprocessor(
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)

with tf.variable_scope("encoder"):
# This converts a 2D mask of shape [batch_size, seq_length] to a 3D
# mask of shape [batch_size, seq_length, seq_length] which is used
# for the attention scores.
attention_mask = create_attention_mask_from_input_mask(
input_ids, input_mask)

# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)

self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))

假定你有兩個句子「I have a apple」,"I have a pen", 把兩個句子分別作為bert的input 那麼你覺得兩個句子中「a」對應的embedding會是完全相等的么（注意是完全相等）？即bert是動態的，能夠很好的解決一詞多義。

word2vec就很好理解了在你的語料訓練好一個lookup table的時候，對於上面的兩個句子中的「a」的embedding只用查詢lookup table就完事了。word2vec是靜態的，一詞多義是個大問題。

還有一點bert是一種更深層的網路，依照elmo的理論，bert的每一次可能都會提取不同的信息，而self-attention提高了抽取信息的能力。而w2v直觀上就能看出就一淺層的神經網路。

如果不熟悉bert的結構，網上一大堆闡述的。可以看樓上的那篇博客，需要注意的是 Bert的transformer block結構跟Transformer中transformer block結構稍微有點不一樣，比如兩者的激活函數是不同的。輸入端也有一點小細節，不是bert所謂的三種embedding相加，而是Transformer中會對輸入word embedding進行一個放大，然後再與positional embedding作加和。而bert中並沒有此種操作。

所謂動態就是每次都要過模型計算，所謂靜態就是用已經訓練好的權重。其實bert的embedding層也可以作為靜態的詞向量。

為何是動態與建模方式有關，從word2vec到bert中間變化太多，具體可見我的一篇博客。https://zhuanlan.zhihu.com/p/164632541

簡單來說，在word2vec中，一個詞不管在什麼樣的上下文環境下，它的向量表示都是一樣的，跟上下文環境無關，也就是所謂的靜態的。而在bert中，一個詞的向量表示也包含了周圍詞的信息，在不同上下文環境下，這個詞的表示是不一樣的，也就是所謂的動態的。

謝邀，說來話長，先要理解Attention 機制哦，先參考jalammar 的Blog 。

The Illustrated Transformer?

jalammar.github.io

周末，再來補充細節。

########################分割線#######################

周末了，我來了。老師這題，我會！昨天我還和同事討論過，為什麼 Bert的詞向量具有一詞多義，我同事的回答也是含糊不清的。為知識所困的可不只是你哦，這好像是滅霸對鋼鐵俠說的話吧。扯遠了，回歸正題。

首先，Bert的實質是Transformer中的Encoder 結構。 Encode結構中最重要的就是Self-Attention機制，先談什麼是Self-Attention。 Attention名字聽起來高大上，本質就是兩個向量的相似度而已，最Naive的相似度計算，不就是兩個向量做點積（dot-product）嗎。假設有兩句話，「我想吃一顆蘋果」和「我想買蘋果手機」，這兩句話中的「蘋果」表示著不同的意義，那麼Self-Attention 機制面對著兩句話，是怎麼Work的呢？

懶得畫圖啊，直接將草稿紙的圖貼上來哦。