word2Vec中跳詞模型實現和背後原理詳解

簡介：

本文分為三個部分，第一部得分主要解釋文檔讀取後格式的轉換，第二部分主要介紹跳詞模型訓練集和測試集的準備；第三部分主要介紹原理和主程序。

開始第一部分：

首先使用一下程序讀取中文文本

raw_word_list = [] with open(txt文件存儲路徑,"r", encoding=UTF-8) as f: line = f.readline() while line: #可添加數據處理過程，例如去除格式文件和空格等。 raw_word_list.extend(raw_words) line=f.readline() return raw_word_list

讀取文本之後形成的words組合

2. 讀取文本之後需要根據文本內文字出現的次數進行排序，這裡我們去最常出現的50000個詞。

def build_dataset(words): count = [[UNK, -1]]#作用時記錄沒有出現的排名50000以內的字數 count.extend(collections.Counter(words).most_common(vocabulary_size - 1))#通過詞頻排序 dictionary = dict() for word, _ in count: dictionary[word] = len(dictionary)#形成字典 data = list() unk_count = 0 ##下面的循環主要是統計沒有在常用的50000個詞出現的詞的個數 for word in words: if word in dictionary: index = dictionary[word] else: index = 0 unk_count += 1 data.append(index) count[0][1] = unk_count ##zip() 函數用於將可迭代的對象作為參數，將對象中對應的元素打包成一個個元組，然後返回由這些元組組成的列表。 reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return data, count, dictionary, reverse_dictionary