![數(shù)據(jù)挖掘和分析29-W19A_第1頁](http://file4.renrendoc.com/view/2fecd960f4c2d94eccdff627d78e9931/2fecd960f4c2d94eccdff627d78e99311.gif)
![數(shù)據(jù)挖掘和分析29-W19A_第2頁](http://file4.renrendoc.com/view/2fecd960f4c2d94eccdff627d78e9931/2fecd960f4c2d94eccdff627d78e99312.gif)
![數(shù)據(jù)挖掘和分析29-W19A_第3頁](http://file4.renrendoc.com/view/2fecd960f4c2d94eccdff627d78e9931/2fecd960f4c2d94eccdff627d78e99313.gif)
![數(shù)據(jù)挖掘和分析29-W19A_第4頁](http://file4.renrendoc.com/view/2fecd960f4c2d94eccdff627d78e9931/2fecd960f4c2d94eccdff627d78e99314.gif)
![數(shù)據(jù)挖掘和分析29-W19A_第5頁](http://file4.renrendoc.com/view/2fecd960f4c2d94eccdff627d78e9931/2fecd960f4c2d94eccdff627d78e99315.gif)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、聚類分析2014.12.30 第十九周數(shù)據(jù)挖掘和分析(fnx)共三十一頁簇的確認(qurn) Cluster Validity 如何驗證和評價聚類分析的結(jié)果? “goodness” of the resulting clusters?But “clusters are in the eye of the beholder”! 為何(wih)評價聚類分析的結(jié)果?避免發(fā)現(xiàn)噪聲產(chǎn)生的模式比較不同的聚類算法比較不同的簇集合簇之間的比較共三十一頁隨機(su j)數(shù)據(jù)中被發(fā)現(xiàn)的簇 Clusters found in Random DataRandom PointsK-meansDBSCANComplete
2、 Link共三十一頁確定數(shù)據(jù)集的聚類趨勢 clustering tendency ,即是否(sh fu)存在非隨機結(jié)構(gòu) 確定正確的簇的個數(shù).評估聚類分析結(jié)果對數(shù)據(jù)的擬合情況 - Use only the data將聚類分析的結(jié)果跟已知的客觀結(jié)果(如,外部提供的類標號)比較比較不同的聚類方法的優(yōu)劣.1,2,3 非監(jiān)督3,4,5 進一步區(qū)分是評估整個聚類還是單個簇 簇確認(qurn)的重要問題 Cluster Validation共三十一頁三種度量方式外部指標(zhbio) External Index: 監(jiān)督的 Used to measure the extent to which cluste
3、r labels match externally supplied class labels.Entropy 熵內(nèi)部指標 Internal Index: 非監(jiān)督的 Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE)相對指標 Relative Index: Used to compare two different clusterings or clusters. Often an external o
4、r internal index is used for this function, e.g., SSE or entropySometimes these are referred to as criteria instead of indicesHowever, sometimes criterion is the general strategy and index is the numerical measure that implements the criterion.簇確認(qurn)的度量 Measures of Cluster Validity共三十一頁兩個矩陣 Two m
5、atrices 鄰近性矩陣 Proximity Matrix理想的鄰近性矩陣 “Incidence” Matrix每個數(shù)據(jù)點對應(yīng)一行(yxng)一列矩陣中每項對應(yīng)的兩點如果是同簇,為1否則為 0 計算兩個矩陣的相關(guān)性Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated.高相關(guān)-簇中的點相近. Not a good measure for some density or contiguity based clusters.非監(jiān)督(jind)簇
6、評估:通過相關(guān) Via Correlation共三十一頁非監(jiān)督(jind)簇評估:通過相關(guān)Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235Corr = -0.5810共三十一頁根據(jù)簇對數(shù)據(jù)排序(pi x)后的相似性矩陣Order the similarity matrix with respect to cluster labels and inspect visually. 通過(tnggu)相似性
7、矩陣共三十一頁通過(tnggu)相似性矩陣Clusters in random data are not so crispDBSCAN共三十一頁通過(tnggu)相似性矩陣Clusters in random data are not so crispK-means共三十一頁通過(tnggu)相似性矩陣Clusters in random data are not so crispComplete Link共三十一頁通過(tnggu)相似性矩陣DBSCAN共三十一頁Clusters in more complicated figures arent well separated內(nèi)部指標 Int
8、ernal Index: 不需要外部(wib)信息Used to measure the goodness of a clustering structure without respect to external informationSSESSE 適合評估多個簇集或者多個簇 (average SSE).也可以用來估計簇的個數(shù)非監(jiān)督(jind)的 Internal Measures: SSE共三十一頁Internal Measures: SSESSE curve for a more complicated data setSSE of clusters found using K-mean
9、s共三十一頁需要框架來解釋度量. For example, if our measure of evaluation has the value, 10, is that good, fair, or poor?統(tǒng)計學角度聚類結(jié)果(ji gu)的分典型性意味著結(jié)果(ji gu)的正確性比較隨機數(shù)據(jù)和聚類后的數(shù)據(jù)的某項指標If the value of the index is unlikely, then the cluster results are validThese approaches are more complicated and harder to understand.如果比
10、較兩個不同的簇集, 框架必要性降低.However, there is the question of whether the difference between two index values is significant簇確認(qurn)的框架 Framework for Cluster Validity共三十一頁ExampleCompare SSE of 0.005 against three clusters in random dataHistogram shows SSE of three clusters in 500 sets of random data points o
11、f size 100 distributed over the range 0.2 0.8 for x and y valuesSSE的統(tǒng)計學框架(kun ji) Statistical Framework for SSE共三十一頁Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Statistical Framework for CorrelationCorr = -0.9235Corr = -0.5810共三十一頁簇凝聚(nn
12、gj)度 Cluster Cohesion: Measures how closely related are objects in a clusterExample: SSE簇分離度 Cluster Separation: Measure how distinct or well-separated a cluster is from other clustersExample: Squared ErrorCohesion is measured by the within cluster sum of squares (SSE)Separation is measured by the b
13、etween cluster sum of squaresWhere |Ci| is the size of cluster i Internal Measures: Cohesion and Separation共三十一頁凝聚(nngj)度和分離度Internal Measures: Cohesion and SeparationExample: SSEBSS + WSS = constant12345m1m2mK=2 clusters:K=1 cluster:共三十一頁A proximity graph based approach can also be used for cohesio
14、n and separation.Cluster cohesion is the sum of the weight of all links within a cluster.Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster.凝聚(nngj)度和分離度Internal Measures: Cohesion and Separationcohesionseparation共三十一頁Silhouette Coefficient combin
15、e ideas of both cohesion and separation, but for individual points, as well as clusters and clusteringsFor an individual point, iCalculate a = average distance of i to the points in its clusterCalculate b = min (average distance of i to points in another cluster)The silhouette coefficient for a poin
16、t is then given by s = 1 a/b if a b, (or s = b/a - 1 if a b, not the usual case) Typically between 0 and 1. The closer to 1 the better.Can calculate the Average Silhouette width for a cluster or a clustering輪廓(lnku)系數(shù) Internal Measures: Silhouette Coefficient共三十一頁外部(wib)指標 External Measures of Clust
17、er Validity: Entropy and Purity共三十一頁確定(qudng)正確的簇個數(shù)共三十一頁 “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have exper
18、ience and great courage.”Algorithms for Clustering Data, Jain and DubesFinal Comment on Cluster Validity共三十一頁使用(shyng)R完成Kmeans聚類newiris - iris;newiris$Species - NULL; #對訓練數(shù)據(jù)去掉分類標記kc - kmeans(newiris, 3); #分類模型訓練fitted(kc); #查看具體分類情況table(iris$Species, kc$cluster); #查看分類概括#聚類結(jié)果可視化 plot(newirisc(Sepa
19、l.Length, Sepal.Width), col = kc$cluster, pch = eger(iris$Species); #不同的顏色代表(dibio)不同的聚類結(jié)果,不同的形狀代表(dibio)訓練數(shù)據(jù)集的原始分類情況。points(kc$centers,c(Sepal.Length, Sepal.Width), col = 1:3, pch = 8, cex=2);共三十一頁/yucan1001/article/details/23123043共三十一頁require(graphics)# a 2-dimensional examplex - rbind(matrix(rno
20、rm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)colnames(x) - c(x, y)(cl - kmeans(x, 2)plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex = 2)# sum of squares# 其中scale函數(shù)提供數(shù)據(jù)中心化功能,所謂數(shù)據(jù)的中心化是指數(shù)據(jù)集中的各項數(shù)據(jù)減去數(shù)據(jù)集的均值,這個函數(shù)還提供數(shù)據(jù)的標準化功能,所謂數(shù)據(jù)的標準化是指中心化之后的數(shù)據(jù)在除以數(shù)據(jù)集的標準差,即
21、數(shù)據(jù)集中的各項數(shù)據(jù)減去數(shù)據(jù)集的均值再除以數(shù)據(jù)集的標準差。見/10/1834.htm。ss - function(x) sum(scale(x, scale = FALSE)2)# cluster centers fitted to each obs.:fitted.x - fitted(cl);head(fitted.x);resid.x - x - fitted(cl);# Equalities : -cbind(clc(betweenss, tot.withinss, totss), # the same two columns c(ss(fitted.x), ss(resid.x), ss(x)# kmeas聚類滿足(mnz)如下條件stopifnot(all.equal(cl$ totss, ss(x), all.equal(cl$ tot.withinss, ss(r
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 文化活動策劃方案范文
- 現(xiàn)代企業(yè)如何依賴云平臺優(yōu)化數(shù)據(jù)審核流程
- 游戲類直播平臺的用戶行為分析與優(yōu)化策略研究
- 現(xiàn)代舞臺背景屏技術(shù)革新與發(fā)展
- 環(huán)保材料在辦公環(huán)境建設(shè)中的應(yīng)用
- 生產(chǎn)過程中的危機應(yīng)對與風險化解
- 未來十年電動汽車市場預(yù)測與展望
- 生態(tài)系統(tǒng)服務(wù)在商業(yè)地產(chǎn)開發(fā)中的應(yīng)用
- 現(xiàn)代網(wǎng)絡(luò)技術(shù)企業(yè)管理的重要支撐
- 18《書湖陰先生壁》說課稿-2024-2025學年統(tǒng)編版語文六年級上冊
- (正式版)HGT 22820-2024 化工安全儀表系統(tǒng)工程設(shè)計規(guī)范
- 養(yǎng)老護理員培訓老年人日常生活照料
- 黑龍江省哈爾濱市八年級(下)期末化學試卷
- 各種抽油泵的結(jié)構(gòu)及工作原理幻燈片
- 學習弘揚雷鋒精神主題班會PPT雷鋒精神我傳承爭當時代好少年P(guān)PT課件(帶內(nèi)容)
- 社區(qū)獲得性肺炎的護理查房
- 體育賽事策劃與管理第八章體育賽事的利益相關(guān)者管理課件
- 專題7閱讀理解之文化藝術(shù)類-備戰(zhàn)205高考英語6年真題分項版精解精析原卷
- 《生物資源評估》剩余產(chǎn)量模型
- 2022年廣東省10月自考藝術(shù)概論00504試題及答案
- 隧道二襯承包合同參考
評論
0/150
提交評論