文本表示之詞袋模型

結合scikit-learn官方文檔，對詞袋模型的一些思考(寫的不好勿噴)

1.BoW(Bag-of-words) 詞袋模型

是n-gram語法模型的特例1元模型

該模型忽略掉文本的語法和語序等要素，將其僅僅看作是若干個辭彙的集合，文檔中每個單詞的出現都是獨立的。BoW使用一組無序的單詞(words)來表達一段文字或一個文檔

詞袋模型的形象圖

下面給出兩個簡單的文本文檔如下：

John likes to watch movies. Mary likes too.
John also likes to watch football games.

基於上述兩個文檔中出現的單詞，構建如下一個詞典 (dictionary)：

{"John": 1, "likes": 2,"to": 3, "watch": 4, "movies": 5,"also": 6, "football": 7, "games": 8,"Mary": 9, "too": 10}

上面的詞典中包含10個單詞, 每個單詞有唯一的索引, 那麼每個文本我們可以使用一個10維的向量來表示。如下：

[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

向量的維度根據詞典中不重複詞的個數確定

向量中每個元素順序與原來文本中單詞出現的順序沒有關係,與詞典中的順序一一對應

向量中每個數字是詞典中每個單詞在文本中出現的頻率---即詞頻表示

2.詞袋模型的實現方式

以句子「我是一個喜歡學習的人，喜歡學習的人」為例，如圖所示

第一種形式是：只要在詞典中出現的詞，無論出現多少次，在BOW向量中都只算成1次，未出現的詞算0次,然後根據詞典構造只有0/1的向量。

代碼使用scikit-learn官方文檔中的API實現:

from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(binary=True)

特點：只強調了出現詞與未出現詞的差異(在有些短文本分類中可能會表現很好)，對於出現詞之間的差異未體現

改進方式：採用第二種方式的詞頻向量

第二種形式是：只要在詞典中出現的詞，出現多少次就算成多少次，未出現的詞算0次，然後根據詞典構造向量

代碼實現:

from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer()

可能會出現的問題：長文本比短文本對每個單詞有更高的出現次數，儘管他們可能在描述同一個主題，用單詞計數會有偏差

改進方式：各文檔中每個單詞的出現次數除以該文檔中所有單詞的總數，也可以對詞頻進行歸一化來降低這種影響

代碼實現:

from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(norm=l2, use_idf=False, sublinear_tf=True)

又出現新的問題：某些常用詞在很多文章中都出現，如「我,的,是,中國,因為」等，這些高頻詞對表現文章主題無多大貢獻，相反一些低頻詞如「蜜蜂，養殖，機床」等更能反映一篇文檔的主題

改進方式：

1.去停用詞：可去掉一些無用的高頻詞

2.在詞頻的基礎上,給每個詞分配一個權重(降低常用詞的影響)，訓練集中很多文檔中均出現的詞(比如中國)給予較小權重，只在很少文檔中出現的詞(比如"蜜蜂"、"養殖)給予較大權重，這個權重叫逆文檔頻率 idf（Inverse Document Frequency）

詞頻與權重相乘得到 tf–idf ，某個詞對文章的重要性越高，它的TF-IDF值就越大。對應上圖第三種方式

代碼實現:

from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(stop_words=[你,我,它],max_df=0.9, use_idf=True, sublinear_tf=True, smooth_idf=True) # 注意：中文一般在外部去完停用詞再用TfidfVectorizer，max_df參數可以輔助去除部分不在停用詞表中的無用高頻詞

又出現新的問題：

如果文檔很多，整個文集中不同單詞的數量非常大，而單個文檔使用的單詞數量遠遠少於總體詞袋單詞個數。tf-idf向量維度會非常大而且每個向量中大多數的值為 0 。此時詞袋模型得到的向量是高維稀疏數據。

注意：tf–idf不會減小向量維度，也不會減少向量中0的個數，只是將詞的重要性(之前是出現次數表示特徵重要性大小)以一種更合理的方式重新計算了

改進方式：採用詞的分散式表示，如word2vec,Glove等

代碼實現(以word2vec為例):

from gensim.models import Word2Vec # 引入數據集 raw_sentences = ["the quick brown fox jumps over the lazy dogs", "yoyoyo you go home now to sleep"] # 切分辭彙,變為list of list格式 sentences= [s.encode(utf-8).split() for s in sentences] # 構建模型 model = Word2Vec(sentences, min_count=1)

又出現新的問題：

word2vec更多學到的是具備相似上下文的詞，學習的向量和真正語義還有差距。比如「他的技術水平很好」，「他的技術水平很差」相似度也很高。

改進方式：不在此篇討論範圍

3.下面結合scikit-learn的官方文檔示例"20news分類"理解詞袋模型在文本分類任務中的用法。

載入訓練集、測試集

from sklearn import datasets twenty_train = datasets.load_files("./data/20news-bydate/20news-bydate-train") twenty_test = datasets.load_files("./data/20news-bydate/20news-bydate-test")

API解釋：

sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error=strict, random_state=0)
- container_path: container_folder的路徑；- load_content = True: 是否把文件中的內容載入到內存；- encoding = None: 編碼方式。當前文本文件的編碼方式一般為「utf-8」，如果不指明編碼方式（encoding=None），那麼文件內容將會按照bytes處理，而不是unicode處理。返回值：Bunch Dictionary-like object.主要屬性有- data: 原始數據；- filenames: 每個文件的名字；

- target: 類別標籤（從0開始的整數索引）；
- target_names: 類別標籤的具體含義（由子文件夾的名字category_1_folder等決定）

將文本文件變成數字的特徵表示(詞袋模型)

1)使用CountVectorizer構建詞頻向量

CountVectorizer 支持單詞或者連續字元的 N-gram 模型的計數,利用scipy.sparse 矩陣只在內存中保存特徵向量中非 0元素位置以節省內存

from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(stop_words="english",decode_error=ignore) # 創建詞頻轉換器 X_train_counts = count_vect.fit_transform(twenty_train.data) # 轉換訓練集 X_train_counts.shape

結果

(2257, 35788)

2)轉化為TF-IDF特徵向量

用TfidfTransformer將詞頻向量轉為Tfidf形式

from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) X_train_tfidf.shape

結果

(2257, 35788) # 2257個文檔,35788個不重複詞

分類器訓練

樸素貝葉斯

1-正常流程

from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(X_train_tfidf, twenty_train.target) docs_new = [God is love,OpenGL on the GPU is fast] X_new_counts = count_vect.transform(docs_new) #計算詞頻 X_new_tfidf = tfidf_transformer.transform(X_new_counts) #計算TF-IDF y_pred = clf.predict(X_new_tfidf) for doc, category in zip(docs_new, y_pred): #category是數字 print(("%r => %s")%(doc, twenty_train.target_names[category]))

結果

God is love => soc.religion.christian OpenGL on the GPU is fast => comp.graphics

2-使用管道流水化

使用管道後，測試集不用一步步重複訓練集的預處理，直接管道處理了

from sklearn.pipeline import Pipeline text_clf = Pipeline([(vect,CountVectorizer(stop_words="english",decode_error=ignore)), (tfidf,TfidfTransformer()), (clf,MultinomialNB()), ]) text_clf = text_clf.fit(twenty_train.data,twenty_train.target) #訓練集 text_clf.score(twenty_test.data, twenty_test.target) #測試集

結果

0.81691449814126393 # 樸素貝葉斯的分類準確率

線性核SVM

大數據量下用SGDClassifier形式的SVC

from sklearn.linear_model import SGDClassifier from sklearn import metrics

text_clf_2 = Pipeline([(vect, CountVectorizer(stop_words=english,decode_error=ignore)), #去停用詞
(tfidf, TfidfTransformer()),
(clf,SGDClassifier(loss = hinge, penalty = l2, alpha = 1e-3, max_iter = 5, random_state = 42)),
])
text_clf_2.fit(twenty_train.data, twenty_train.target)
# text_clf_2.score(twenty_test.data, twenty_test.target)
y_pred = text_clf_2.predict(twenty_test.data)
metrics.accuracy_score(twenty_test.target, y_pred)

結果

0.82355284121083383 # 準確率有所提升

report報告

各類別的精確度，召回率，F值等

print(metrics.classification_report(twenty_test.target, y_pred, target_names = twenty_test.target_names)) # 要加print

畫混淆矩陣

print(metrics.confusion_matrix(twenty_test.target, y_pred))

結果

[[225 1 0 1 0 1 2 0 2 3 0 2 1 8 6 47 2 6 1 11] [ 2 273 20 8 9 28 3 1 4 7 3 9 4 1 9 2 2 3 0 1] [ 0 10 304 24 11 11 1 2 1 5 3 8 2 1 7 1 0 1 0 2] [ 3 8 32 265 19 4 17 2 3 3 2 3 20 1 5 0 1 2 1 1] [ 1 4 8 26 315 2 10 0 1 2 3 1 6 1 1 0 2 0 2 0] [ 1 29 41 0 3 303 2 0 1 1 1 2 1 1 7 1 1 0 0 0] [ 0 3 0 18 6 0 340 8 1 2 3 1 3 2 2 0 1 0 0 0] [ 1 1 1 2 1 0 10 354 7 1 0 0 9 1 3 0 4 0 1 0] [ 0 0 0 1 0 0 4 5 385 1 0 0 1 1 0 0 0 0 0 0] [ 0 0 0 0 1 0 3 0 0 361 31 0 0 0 0 0 1 0 0 0] [ 0 0 0 0 1 0 0 0 0 3 393 0 0 0 0 2 0 0 0 0] [ 1 1 1 0 2 0 3 3 0 0 1 380 1 1 0 0 1 0 1 0] [ 8 5 9 27 11 4 7 9 6 5 4 26 243 5 13 6 2 1 2 0] [ 2 4 0 0 2 2 4 0 3 3 4 1 6 343 2 7 3 4 5 1] [ 0 3 0 0 1 0 2 0 0 0 1 1 0 3 380 2 0 0 1 0] [ 11 0 2 1 0 0 0 0 1 0 0 0 2 1 4 371 0 0 0 5] [ 0 0 0 1 1 0 2 2 1 2 2 5 0 1 3 0 340 1 2 1] [ 11 1 0 0 1 4 0 1 0 2 2 1 0 1 1 1 1 348 1 0] [ 3 1 0 0 1 1 1 0 1 0 3 4 0 4 7 3 102 5 174 0] [ 43 1 1 0 0 0 2 2 0 1 1 1 0 4 6 60 21 6 4 98]]

網格搜索進行超參數優化

CountVectorizer()中的n-gram
TfidfTransformer()中的use_idfSGClassifier()中的懲罰係數alpha

from sklearn.model_selection import GridSearchCV parameters = { vect__ngram_range:[(1,1), (1,2)], tfidf__use_idfTrue, False), clf__alpha1e-2, 1e-3) } gs_clf = GridSearchCV(text_clf_2, parameters, n_jobs = -1, cv=5) # text_clf_2: SVC的pipeline,n_jobs=-1, 5折交叉驗證 grid_result = gs_clf.fit(twenty_train.data, twenty_train.target) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # best_params_：<class dict>

結果

Best: 0.906399 using {tfidf__use_idf: True, vect__ngram_range: (1, 2), clf__alpha: 0.01}

隨機搜索進行超參數優化

from sklearn.model_selection import RandomizedSearchCV import numpy as np parameters = { vect__ngram_range:[(1,1),(1,2),(2,2)], tfidf__use_idfTrue,False), clf__alpha1e-1,1e-2,1e-3,1e-4,1e-5) } rs_clf = RandomizedSearchCV(text_clf_2, parameters,n_jobs = -1, cv=5) rs_result = rs_clf.fit(twenty_train.data, twenty_train.target) print("Best: %f using %s" % (rs_result.best_score_, rs_result.best_params_))

結果

Best: 0.924783 using {clf__alpha: 1e-05, tfidf__use_idf: True, vect__ngram_range: (1, 2)}

網格搜索增加超參數

已確定TfidfTransformer里的參數use_idf=True會更好。

增加：

CountVectorizer里的max_df：按比例或絕對數量刪除df超過max_df的詞
CountVectorizer里的max_features: 選擇tf最大的max_features個特徵TfidfTransformer里的norm：數據標準化

SGDClassifier里的penalty：懲罰項
SGDClassifier里的n_iter：迭代次數

重要的參數：

clf__alpha
clf__penaltyvect__max_dfvect__ngram_range

超參數中，帶#的可以不調

from sklearn.model_selection import GridSearchCV from sklearn.linear_model import SGDClassifier from sklearn.pipeline import Pipeline

text_clf_3 = Pipeline([
(vect, CountVectorizer()),
(tfidf, TfidfTransformer()),
(clf, SGDClassifier()),
])

parameters = {
vect__max_df: (0.5, 0.75, 1.0),
vect__max_features: (None, 10000, 50000,60000), #
vect__ngram_range: [(1, 1), (1, 2),(2,2)],
tfidf__norm: (l1, l2), #
clf__alpha: (0.0001, 0.00001, 0.000001,0.0000001),
clf__penalty: (l2, elasticnet),
clf__max_iter: (10, 50, 80), #
}

gs_clf = GridSearchCV(text_clf_3, parameters, n_jobs = -1, cv=5) # text_clf_2: SVC的pipeline,n_jobs=-1，5折交叉驗證
grid_result = gs_clf.fit(twenty_train.data, twenty_train.target)
print("Best: %f
using %s" % (grid_result.best_score_, grid_result.best_params_))

結果

Best: 0.931766 using {vect__ngram_range: (1, 2), clf__alpha: 1e-07, vect__max_df: 0.75, vect__max_features: None, clf__n_iter: 50, clf__penalty: elasticnet, tfidf__norm: l1}

結論

準確率從調參前的0.82到調參後的0.93，說明仔細調參還是很有效果的

文本表示之詞袋模型

1.BoW(Bag-of-words) 詞袋模型

2.詞袋模型的實現方式

3.下面結合scikit-learn的官方文檔示例"20news分類"理解詞袋模型在文本分類任務中的用法。

載入訓練集、測試集

將文本文件變成數字的特徵表示(詞袋模型)

分類器訓練

report報告

畫混淆矩陣

隨機搜索進行超參數優化

網格搜索增加超參數

結論

热门新闻

周热门

文本表示之詞袋模型

1.BoW(Bag-of-words) 詞袋模型

2.詞袋模型的實現方式

3.下面結合scikit-learn的官方文檔示例"20news分類"理解詞袋模型在文本分類任務中的用法。

載入訓練集、測試集

將文本文件變成數字的特徵表示(詞袋模型)

分類器訓練

report報告

畫混淆矩陣

隨機搜索進行超參數優化

網格搜索增加超參數

結論

ACL 2019將會有哪些值得關注的論文？

nlp的word2vec中如何把英文片語向量化？

為什麼softmax很少會出現[0.5，0.5]？

人工智慧領域裡的Interpretability和Explainability有什麼區別嗎？

如何自學《模式識別與機器學習》這本書？

有哪些非黑箱的機器學習模型，或者預測演算法？

如果不從事機器學習，人工智慧，大數據這塊，未來還有機會在互聯網賺錢養家嗎？本人三本院校軟體專業。？

新手如何上手機器學習？

本人碩一小白，最近找課題，想問問深度學習GAN這個方向有哪些具有實際應用價值還好發文章的課題?

為什麼batch_size 增大會增加每一步的運算時間？

經典Python入門書籍都是python2.x, 先學這些再轉3.x難不難？

深度學習如何解決低信噪比下的檢測識別問題？

在2019年，使用AMD顯卡是否相當於告別深度學習，未來CUDA在機器學習領域的壟斷有可能被打破嗎？

在深度學習領域，預訓練有一定作用的本質原因是什麼？

Attention模型理解？

热门新闻

周热门