gensim word2vec詞向量模型快速載入

背景：

在使用別人已經訓練好的word2vec模型時，如Chinese Word Vectors 中文詞向量，

Embedding/Chinese-Word-Vectors?

github.com

常常這種模型很大（上述鏈接中的Skip-Gram with Negative Sampling (SGNS)中，Mixed-large 綜合行、Word + Ngram列下載解壓後的文件為3.43G），將他們載入到內存中再求詞向量顯然會很費時，求model.most_similar時同樣需要搜索整個內存空間.

在stackoverflow中，關於訓練好的的word2vec模型加速載入、使用回答鏈接如下：

How to speed up Gensim Word2vec model load time??

stackoverflow.com

本文嘗試解決的問題是：

如何使用已訓練的word2vec模型，快速求有限辭彙表的詞向量。

2. 方法：

構建有限辭彙表，轉存模型中的詞向量為csv或其他格式，使用時載入為字典實現快速讀取。

3. 代碼：

import numpy as np import pandas as pd

# 用於過濾在word2vec中的詞
def get_vocabulary_vector():
# 載入已下載的word2vec解壓後的模型
print("start word2vec load ......")
from gensim.models import KeyedVectors
wv_from_text = KeyedVectors.load_word2vec_format(Config().model_path+sgns.merge.bigram,
binary=False, encoding="utf8", unicode_errors=ignore) # C text format
print("word2vec load succeed")

# 所有文本構建辭彙表，words_cut 為分詞後的list，每個元素為以空格分隔的str.
vocabulary = list(set([word for item in words_cut for word in item.split()]))

# 構建辭彙-向量字典
vocabulary_vector = {}
for word in vocabulary:
if word in wv_from_text:
vocabulary_vector[word] = wv_from_text[word]
# 儲存辭彙-向量字典，由於json文件不能很好的保存numpy詞向量，故使用csv保存
pd.DataFrame(vocabulary_vector).to_csv(config.model_path+"vocabulary_vector.csv")

if __name__=="__main__":
# 讀取辭彙-向量字典，csv轉字典
vocabulary_vector = dict(pd.read_csv(Config().model_path+"vocabulary_vector.csv"))
# 此時需要將字典中的詞向量np.array型數據還原為原始類型，方便以後使用
for key,value in vocabulary_vector.items():
vocabulary_vector[key] = np.array(value)
print("vocabulary vector load succeed")