R語言的三種聚類方法

摘要：層次聚類 kmeans dbscan筆記

一、距離和相似係數

r語言中使用dist(x， method = 「euclidean」，diag = FALSE， upper = FALSE， p = 2) 來計算距離。其中x是樣本矩陣或者數據框。method表示計算哪種距離。method的取值有：

euclidean 歐幾里德距離，就是平方再開方。

maximum 切比雪夫距離

manhattan 絕對值距離

canberra Lance 距離

minkowski 明科夫斯基距離，使用時要指定p值

binary 定性變數距離.

定性變數距離：記m個項目裡面的 0：0配對數為m0 ，1：1配對數為m1，不能配對數為m2，距離=m1/(m1+m2)；

diag 為TRUE的時候給出對角線上的距離。upper為TURE的時候給出上三角矩陣上的值。

r語言中使用scale(x， center = TRUE， scale = TRUE) 對數據矩陣做中心化和標準化變換。

如只中心化 scale(x，scale=F) ，

r語言中使用sweep(x， MARGIN， STATS， FUN=」-「， …) 對矩陣進行運算。MARGIN為1，表示行的方向上進行運算，為2表示列的方向上運算。STATS是運算的參數。FUN為運算函數，默認是減法。下面利用sweep對矩陣x進行極差標準化變換

>center <- sweep(x， 2， apply(x， 2， mean)) #在列的方向上減去均值。 >R <- apply(x， 2， max) - apply(x，2，min) #算出極差，即列上的最大值-最小值 >x_star <- sweep(center， 2， R， "/") #把減去均值後的矩陣在列的方向上除以極差向量

center <- sweep(x, 2, apply(x, 2, min)) #極差正規化變換
R <- apply(x, 2, max) - apply(x,2,min)x_star <- sweep(center, 2, R, "/")

有時候我們不是對樣本進行分類，而是對變數進行分類。這時候，我們不計算距離，而是計算變數間的相似係數。常用的有夾角和相關係數。

r語言計算兩向量的夾角餘弦：

y <- scale(x， center = F， scale = T)/sqrt(nrow(x)-1) C <- t(y) %*% y

相關係數用cor函數

二、層次聚類法

層次聚類法。先計算樣本之間的距離。每次將距離最近的點合併到同一個類。然後，再計算類與類之間的距離，將距離最近的類合併為一個大類。不停的合併，直到合成了一個類。其中類與類的距離的計算方法有：最短距離法，最長距離法，中間距離法，類平均法等。比如最短距離法，將類與類的距離定義為類與類之間樣本的最段距離。。。

r語言中使用hclust(d， method = 「complete」， members=NULL) 來進行層次聚類。

其中d為距離矩陣。

method表示類的合併方法，有：

single 最短距離法

complete 最長距離法

median 中間距離法

mcquitty 相似法

average 類平均法

centroid 重心法

ward 離差平方和法

> x <- c(1,2,6,8,11) #試用一下 > dim(x) <- c(5,1) > d <- dist(x) > hc1 <- hclust(d,"single") > plot(hc1) > plot(hc1,hang=-1,type="tirangle") #hang小於0時，樹將從底部畫起。 #type = c("rectangle", "triangle"),默認樹形圖是方形的。另一個是三角形。 #horiz TRUE 表示豎著放，FALSE表示橫著放。

> z <- scan()
1： 1.000 0.846 0.805 0.859 0.473 0.398 0.301 0.382
9： 0.846 1.000 0.881 0.826 0.376 0.326 0.277 0.277
17： 0.805 0.881 1.000 0.801 0.380 0.319 0.237 0.345
25： 0.859 0.826 0.801 1.000 0.436 0.329 0.327 0.365
33： 0.473 0.376 0.380 0.436 1.000 0.762 0.730 0.629
41： 0.398 0.326 0.319 0.329 0.762 1.000 0.583 0.577
49： 0.301 0.277 0.237 0.327 0.730 0.583 1.000 0.539
57： 0.382 0.415 0.345 0.365 0.629 0.577 0.539 1.000
65：
Read 64 items
> names
[1] "shengao" "shoubi" "shangzhi" "xiazhi" "tizhong"
[6] "jingwei" "xiongwei" "xiongkuang"
> r <- matrix(z，nrow=8，dimnames=list(names，names))
> d <- as.dist(1-r)
> hc <- hclust(d)
> plot(hc)

然後可以用rect.hclust(tree， k = NULL， which = NULL， x = NULL， h = NULL，border = 2， cluster = NULL)來確定類的個數。 tree就是求出來的對象。k為分類的個數，h為類間距離的閾值。border是畫出來的顏色，用來分類的。

> plot(hc) > rect.hclust(hc，k=2) > rect.hclust(hc，h=0.5)

result=cutree(model,k=3) 該函數可以用來提取每個樣本的所屬類別

三、動態聚類 kmeans

層次聚類，在類形成之後就不再改變。而且數據比較大的時候更占內存。

動態聚類，先抽幾個點，把周圍的點聚集起來。然後算每個類的重心或平均值什麼的，以算出來的結果為分類點，不斷的重複。直到分類的結果收斂為止。r語言中主要使用kmeans(x， centers， iter.max = 10， nstart = 1，algorithm =c(「Hartigan-Wong」，「Lloyd」，」Forgy」，「MacQueen」))來進行聚類。centers是初始類的個數或者初始類的中心。iter.max是最大迭代次數。nstart是當centers是數字的時候，隨機集合的個數。algorithm是演算法，默認是第一個。

> newiris <- iris > model <- kmeans(scale(newiris[1：4])，3) > model K-means clustering with 3 clusters of sizes 50， 47， 53 Cluster means： Sepal.Length Sepal.Width Petal.Length Petal.Width 1 -1.01119138 0.85041372 -1.3006301 -1.2507035 2 1.13217737 0.08812645 0.9928284 1.0141287 3 -0.05005221 -0.88042696 0.3465767 0.2805873 Clustering vector： [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 [75] 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 3 2 [149] 2 3 Within cluster sum of squares by cluster： [1] 47.35062 47.45019 44.08754 (between_SS / total_SS = 76.7 %) Available components： [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault" > table(iris$Species，kc$cluster) Error in table(iris$Species， kc$cluster) ： object kc not found > table(iris$Species，model$cluster) #比較一下 1 2 3 setosa 50 0 0 versicolor 0 11 39 virginica 0 36 14 > plot(newiris[c("Sepal.Length"，"Sepal.Width")]，col=model$cluster) #畫出聚類圖

四、DBSCAN

動態聚類往往聚出來的類有點圓形或者橢圓形。基於密度掃描的演算法能夠解決這個問題。思路就是定一個距離半徑，定最少有多少個點，然後把可以到達的點都連起來，判定為同類。在r中的實現

dbscan(data， eps， MinPts， scale， method， seeds， showplot， countmode)

其中eps是距離的半徑，minpts是最少多少個點。 scale是否標準化（我猜) ，method 有三個值raw，dist，hybird，分別表示，數據是原始數據避免計算距離矩陣，數據就是距離矩陣，數據是原始數據但計算部分距離矩陣。showplot畫不畫圖，0不畫，1和2都畫。countmode，可以填個向量，用來顯示計算進度。用鳶尾花試一試

> install.packages("fpc"， dependencies=T) > library(fpc) > newiris <- iris[1：4] > model <- dbscan(newiris，1.5，5，scale=T，showplot=T，method="raw")# 畫出來明顯不對把距離調小了一點 > model <- dbscan(newiris，0.5，5，scale=T，showplot=T，method="raw") > model #還是不太理想…… dbscan Pts=150 MinPts=5 eps=0.5 0 1 2 border 34 5 18 seed 0 40 53 total 34 45 71