BERT源碼分析PART I

寫在前面

這是之前看BERT源碼時的筆記，現在同步到專欄裏~

注意，源碼閱讀系列需要提前對NLP相關知識有所瞭解，比如attention機制、transformer框架以及python和tensorflow

基礎等，關於BERT的原理不是本文的重點。

附上關於BERT的資料匯總：

BERT相關論文、文章和代碼資源匯總 | 我愛自然語言處理?

www.52nlp.cn

kaiyuan：芝麻街預處理?

zhuanlan.zhihu.com

Jiakui/awesome-bert?

github.com

今天要介紹的是BERT最主要的模型實現部分-----BertModel，代碼位於

√ modeling.py模塊

下面每部分內容代碼塊外為該模塊的大體作用說明，代碼塊內會有較為具體的注釋。如有解讀不正確，請務必指出~

1、配置類（BertConfig）

這部分代碼主要定義了BERT模型的一些默認參數，另外包括了一些文件處理函數。

class BertConfig(object): """BERT模型的配置類."""

def __init__(self,
vocab_size,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
initializer_range=0.02):

self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range

@classmethod
def from_dict(cls, json_object):
"""Constructs a `BertConfig` from a Python dictionary of parameters."""
config = BertConfig(vocab_size=None)
for (key, value) in six.iteritems(json_object):
config.__dict__[key] = value
return config

@classmethod
def from_json_file(cls, json_file):
"""Constructs a `BertConfig` from a json file of parameters."""
with tf.gfile.GFile(json_file, "r") as reader:
text = reader.read()
return cls.from_dict(json.loads(text))

def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output

def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "
"

參數具體含義

vocab_size：詞表大小
hidden_size：隱藏層神經元數num_hidden_layers：Transformer encoder中的隱藏層數num_attention_heads：multi-head attention 的head數intermediate_size：encoder的「中間」隱層神經元數（例如feed-forward layer）hidden_act：隱藏層激活函數hidden_dropout_prob：隱層dropout率

attention_probs_dropout_prob：注意力部分的dropout
max_position_embeddings：最大位置編碼type_vocab_size：token_type_ids的詞典大小initializer_range：truncated_normal_initializer初始化方法的stdev

這裡要注意一點，可能剛看的時候對type_vocab_size這個參數會有點不理解，其實就是在next sentence prediction任務裏的Segment A和 Segment B。在下載的bert_config.json文件裏也有說明，默認值應該為2。

參考這個Issue?

github.com

2、獲取詞向量（Embedding_lookup）

對於輸入word_ids，返回embedding table。可以選用one-hot或者tf.gather()

def embedding_lookup(input_ids, # word_id：【batch_size, seq_length】 vocab_size, embedding_size=128, initializer_range=0.02, word_embedding_name="word_embeddings", use_one_hot_embeddings=False):

# 該函數默認輸入的形狀為【batch_size, seq_length, input_num】
# 如果輸入為2D的【batch_size, seq_length】，則擴展到【batch_size, seq_length, 1】
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])

embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))

flat_input_ids = tf.reshape(input_ids, [-1]) #【batch_size*seq_length*input_num】
if use_one_hot_embeddings:
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else: # 按索引取值
output = tf.gather(embedding_table, flat_input_ids)

input_shape = get_shape_list(input_ids)

# output：[batch_size, seq_length, num_inputs]
# 轉成:[batch_size, seq_length, num_inputs*embedding_size]
output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size])
return (output, embedding_table)

參數具體含義

input_ids：word id 【batch_size, seq_length】
vocab_size：embedding詞表embedding_size：embedding維度

initializer_range：embedding初始化範圍
word_embedding_name：embeddding table命名use_one_hot_embeddings：是否使用one-hotembeddingReturn：【batch_size, seq_length, embedding_size】

3、詞向量的後續處理（embedding_postprocessor）

我們知道BERT模型的輸入有三部分：token embedding ，segment embedding以及position embedding。上一節中我們只獲得了token embedding，這部分代碼對其完善信息，正則化，dropout之後輸出最終embedding。

注意，在Transformer論文中的position embedding是由sin/cos函數生成的固定的值，而在這裡代碼實現中是跟普通word embedding一樣隨機生成的，可以訓練的。作者這裡這樣選擇的原因可能是BERT訓練的數據比Transformer那篇大很多，完全可以讓模型自己去學習。