很久之前就用kmeans演算法做文本聚類,最近挖掘數據過程中用到了dbscan聚類方法,發現這是兩種差異很大的聚類方法。 這裡簡單介紹下dbscan。
Density-based spatial clustering of applications with noise (DBSCAN) 是一種基於密度的聚類演算法。就像名字中提到的一樣,可以有效的剔除雜訊點(離群點)。 聚類演算法多種策略演算法,例如Hierarchical methods(例如BIRCH),Partition-based methods(例如kmeans), Density-based methods(例如dbscan),Grid-based methods(例如STING),Model-based methods(如GMM)。
考慮空間總有很多待聚類的點。 dbscan的基本概念是core point,即核心點,是指周圍一定距離類鄰居點達到一定數量的點。假設有數據
例1: 下圖中, 所有紅點、B、C屬於一個cluster
例二:下圖中,綠色箭頭相連的紅點即其鄰域內的點屬於一個cluster;圖中共兩個cluster。
package dbscan
const ( NOISE = false CLUSTERED = true )
type Clusterable interface { Distance(c interface{}) float64 GetID() string }
type Cluster []Clusterable
func Clusterize(objects []Clusterable, minPts int, eps float64) []Cluster { clusters := make([]Cluster, 0) visited := map[string]bool{} var cls_label = make([]int, len(objects)) for index, _ := range objects { cls_label[index] = -1 //未被聚類 } cls_id := 1 for index, _ := range objects { if cls_label[index] == -1 { cluster := make(Cluster, 0) //以該核心點創建一個新簇 cluster = expandCluster(index, cluster, cls_label, cls_i d, objects, visited, minPts, eps) if len(cluster) >= minPts { clusters = append(clusters, cluster) cls_id += 1 } else { cls_label[index] = -2 //離群點 } } } return clusters }
func findNeighbours(pointID int, points []Clusterable, eps float64) []int { neighbours_ids := make([]int, 0) for index, potNeigb := range points { if index != pointID && potNeigb.Distance(points[pointID]) <= eps { neighbours_ids = append(neighbours_ids, index) } } return neighbours_ids }
func expandCluster(pointID int, cluster Cluster, cls_label []int, cluster_id int, object []Clusterable, visited map[string]bool, minPts int, eps float64) Cluster { seeds := findNeighbours(pointID, object, eps) if len(seeds)+1 < minPts { cls_label[pointID] = -2 //雜訊 return cluster } else { cls_label[pointID] = cluster_id cluster = append(cluster, object[pointID]) for _, seed_id := range seeds { cluster = append(cluster, object[seed_id]) cls_label[seed_id] = cluster_id } for len(seeds) > 0 { cur_id := seeds[0] nei_nei := findNeighbours(cur_id, object, eps) if len(nei_nei)+1 > minPts { for index, _ := range nei_nei { nei_nei_id := nei_nei[index] if cls_label[nei_nei_id] == -1 || cls_label[nei_nei_id] == -2 { if cls_label[nei_nei_id] == -1 { seeds = append(seeds, nei_nei_id) } cls_label[nei_nei_id] = cluster_id cluster = append(cluster, object[nei_nei_id]) } } } seeds = seeds[1:] } } return cluster }
func merge(one []Clusterable, two []Clusterable) []Clusterable { mergeMap := make(map[string]Clusterable) putAll(mergeMap, one) putAll(mergeMap, two) merged := make([]Clusterable, 0) for _, val := range mergeMap { merged = append(merged, val) } return merged }
func putAll(m map[string]Clusterable, list []Clusterable) { for _, val := range list { m[val.GetID()] = val } }
(1)對於一個點可能作為多個cluster的edge。 上面給出的實現代碼會將給點都放進cluster,但是cls_label(記錄數據所屬cluster的編號)會被更新,即該點會再cls_label中被指定為新的cluster。 這種情況,可能有多種實現。
(2) 使用時需要指定 和minPt來表示密度,可以根據業務多次試驗; 不需要像kmeans一樣指定K的大小。
參考:
DBSCAN - Wikipedia
choffstein/dbscan
sohlich/go-dbscan