一、概念

首先我們來看一下停用詞的概念,然後來介紹使用nltk如何刪除英文的停用詞:

由於一些常用字或者詞使用的頻率相當的高,英語中比如a,the, he等,中文中比如:我、它、個等,每個頁面幾乎都包含了這些辭彙,如果搜索引擎它們當關鍵字進行索引,那麼所有的網站都會被索引,而且沒有區分度,所以一般把這些詞直接去掉,不可當做關鍵詞。

二、使用nltk刪除英文停用詞

首先我import stopwords進來,代碼如下:

from nltk.corpus import stopwords
words = stopwords.words(english)
print(words)

首先看看列印停用詞的結果:

[i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, should, now, d, ll, m, o, re, ve, y, ain, aren, couldn, didn, doesn, hadn, hasn, haven, isn, ma, mightn, mustn, needn, shan, shouldn, wasn, weren, won, wouldn]

當然在很多任務(比如對話任務中)中,停用詞還包括下面這些符合和後綴:

[!, , ,. ,? ,-s ,-ly ,</s> , s]

使用下面代碼,將他們加上去

for w in [!,,,.,?,-s,-ly,</s>,s]:
self.stopwords.add(w)

然後刪除的用法就非常容易,假如我們的語料在word_list中,我們只需要寫上下面的代碼即可!

from nltk.corpus import stopwords
for w in [!,,,.,?,-s,-ly,</s>,s]:
self.stopwords.add(w)
filtered_words = [word for word in word_list if word not in stopwords.words(english)]

推薦閱讀:

相關文章