簡介:

本文分為三個部分,第一部得分主要解釋文檔讀取後格式的轉換,第二部分主要介紹跳詞模型訓練集和測試集的準備;第三部分主要介紹原理和主程序。

開始第一部分:

  1. 首先使用一下程序讀取中文文本

raw_word_list = []
with open(txt文件存儲路徑,"r", encoding=UTF-8) as f:
line = f.readline()
while line:
#可添加數據處理過程,例如去除格式文件和空格等。
raw_word_list.extend(raw_words)
line=f.readline()
return raw_word_list

讀取文本之後形成的words組合

2. 讀取文本之後需要根據文本內文字出現的次數進行排序,這裡我們去最常出現的50000個詞。

def build_dataset(words):
count = [[UNK, -1]]#作用時記錄沒有出現的排名50000以內的字數
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))#通過詞頻排序
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)#形成字典
data = list()
unk_count = 0
##下面的循環主要是統計沒有在常用的50000個詞出現的詞的個數
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0
unk_count += 1
data.append(index)
count[0][1] = unk_count
##zip() 函數用於將可迭代的對象作為參數,將對象中對應的元素打包成一個個元組,然後返回由這些元組組成的列表。
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary

2 生成訓練數據和測試數據

data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [skip_window]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]#skip_window一直都是中心詞
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index])##隊列增加一個新元素,前端減少一個元素
data_index = (data_index + 1) % len(data)
return batch, labels

這裡首先定義了一個全局變數,原因時數據每次取之後要向後移動,這裡使用了隊列的形式定義了一個窗口的大小buffer = collections.deque(maxlen=span),主要作用是類似滑動窗口處理窗口內的數據處理。

跳字模型假設基於某個詞來生成它在文本序列周圍的詞。舉個例子,假設文本序列是「the」「man」「loves」「his」「son」。以「loves」作為中心詞,設背景窗口大小為2。如圖10.1所示,跳字模型所關心的是,給定中心詞「loves」,生成與它距離不超過2個詞的背景詞「the」「man」「his」「son」的條件概率,即

P(``the",``man",``his",``son"∣``loves").P(``the",``man",``his",``son"∣``loves").

假設給定中心詞的情況下,背景詞的生成是相互獨立的,那麼上式可以改寫成

P(``the"∣``loves")?P(``man"∣``loves")?P(``his"∣``loves")?P(``son"∣``loves").P(``the"∣``loves")?P(``man"∣``loves")?P(``his"∣``loves")?P(``son"∣``loves").

第三部分

graph = tf.Graph()
with graph.as_default():
# Input data.
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device(/cpu:0):
# Look up embeddings for inputs.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

tf.nn.embedding_lookup函數的用法主要是選取一個張量裡面索引對應的元素

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]),dtype=tf.float32)

# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
inputs=embed,
labels=train_labels,
num_sampled=num_sampled,
num_classes=vocabulary_size))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)

# Add variable initializer.
init = tf.global_variables_initializer()

num_steps = 3000000

with tf.Session(graph=graph) as session:
# We must initialize all variables before we use them.
init.run()
average_loss = 0
for step in xrange(num_steps):
batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

# We perform one update step by evaluating the optimizer op (including it
# in the list of returned values for session.run()
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val

if step % 2000 == 0:
if step > 0:
average_loss /= 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print("Average loss at step ", step, ": ", average_loss)
average_loss = 0
final_embeddings = normalized_embeddings.eval()

這一步實現較為簡單相信很多小夥伴都能輕鬆閱讀,本人覺得理解較難的部分應該屬於tf.nn.nce_loss這一個優化器,這就牽扯到跳詞模型的底層原理,首先我介紹以下交叉熵的來歷。

https://blog.csdn.net/tsyccnh/article/details/79163834?

blog.csdn.net

這個較為詳細的介紹了交叉熵的來歷。

word2vec原理(三) 基於Negative Sampling的模型?

www.cnblogs.com
圖標

通過負採樣能加速,詞向量訓練的過程,下面介紹NCE的詳細原理參見以下鏈接。

NCE詳細介紹


推薦閱讀:
相关文章