什麼是K-Means演算法?

K-Means演算法是最經典的基於劃分的聚類方法,它的中心思想是,以空間中k個點為中心進行聚類,通過迭代的方法,逐次更新各聚類中心的值,直至得到最好的聚類結果。

  • 從n個數據對象任意選擇k個對象作為初始聚類中心,對於剩餘的其他對象,則根據它們與這些聚類中心的相似度,分別將它們分配給與其最相似的聚類。
  • 計算每個新聚類的聚類中心。
  • 不斷重複這一過程,直到標準聚類函數開始收斂。

1.數據蒐集和數據清洗

使用alexa前1000域名(679個樣本)作為白樣本,標記為0,由於域名長度設置最低長度10,所以選取到679個樣本,

使用dga-cryptolocker(1000個樣本)和dga-tovar-goz(1000個樣本)做為黑樣本,標記為1和2,總共2679個樣本。

x1_domain_list = load_alexa("../data/top-1000.csv")
x2_domain_list = load_dga("../data/dga-cryptolocke-1000.txt")
x3_domain_list = load_dga("../data/dga-post-tovar-goz-1000.txt")

x_domain_list=np.concatenate((x1_domain_list, x2_domain_list,x3_domain_list))

y1=[0]*len(x1_domain_list)
y2=[1]*len(x2_domain_list)
y3=[2]*len(x3_domain_list)
y=np.concatenate((y1, y2,y3))

2.特徵化

2-gram處理dga域名,切割單元是字元,以整個數據集合的2-gram結果作為辭彙表並並行映射,得到特徵化的向量:

cv = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
token_pattern=r"w", min_df=1)
x= cv.fit_transform(x_domain_list).toarray()

特徵名稱向量像下面這樣:

[u0 0, u0 1, u0 2, u0 3, u0 4, u0 5, u0 6, u0 7, u0 8, u0 9,uz u, uz v, uz w, uz x, uz y, uz z]

3.訓練樣本

實例化K-Means演算法:

model=KMeans(n_clusters=2, random_state=random_state)

4.效果驗證

對比訓練結果與實際結果:

y_pred = model.fit_predict(x)
print y_pred
print y
print np.mean(y_pred==y)*100

準確率78%,不咋滴。

[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0
0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0
0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 1 1 1
1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 0
1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
1 0 1 1 1 0 1 1 1 0 1 1 1 0 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
78.0

5.完整代碼

import sys
import re
import numpy as np
from sklearn.externals import joblib
import csv
import matplotlib.pyplot as plt
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation
import os
from sklearn.naive_bayes import GaussianNB
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

#處理域名的最小長度
MIN_LEN=10

#隨機程度
random_state = 170

def load_alexa(filename):
domain_list=[]
csv_reader = csv.reader(open(filename))
for row in csv_reader:
domain=row[1]
if domain >= MIN_LEN:
domain_list.append(domain)
return domain_list

def load_dga(filename):
domain_list=[]
#xsxqeadsbgvpdke.co.uk,Domain used by Cryptolocker - Flashback DGA for 13 Apr 2017,2017-04-13,
# http://osint.bambenekconsulting.com/manual/cl.txt
with open(filename) as f:
for line in f:
domain=line.split(",")[0]
if domain >= MIN_LEN:
domain_list.append(domain)
return domain_list

def kmeans_dga():
x1_domain_list = load_alexa("../data/dga/top-100.csv")
x2_domain_list = load_dga("../data/dga/dga-cryptolocke-50.txt")
x3_domain_list = load_dga("../data/dga/dga-post-tovar-goz-50.txt")

x_domain_list=np.concatenate((x1_domain_list, x2_domain_list,x3_domain_list)

y1=[0]*len(x1_domain_list)
y2=[1]*len(x2_domain_list)
y3=[1]*len(x3_domain_list)
y=np.concatenate((y1, y2,y3))

cv = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
token_pattern=r"w", min_df=1)
x= cv.fit_transform(x_domain_list).toarray()
model=KMeans(n_clusters=2, random_state=random_state)

y_pred = model.fit_predict(x)
print y_pred
print y
print np.mean(y_pred==y)*100

if __name__ == __main__:
kmeans_dga()

推薦閱讀:

查看原文 >>
相關文章