現(xiàn)代回歸和分類

上傳人：6*** IP屬地：湖北上傳時間：2021-11-05 格式：PPTX 頁數(shù)：93 大?。?.93MB 積分：30 舉報 版權申訴

已閱讀5頁，還剩88頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

1、現(xiàn)代回歸和分類現(xiàn)代回歸和分類: 算法建模的回歸和分類算法建模的回歸和分類吳喜之吳喜之算法建模的回歸和分類算法建模的回歸和分類經(jīng)典的回歸和分類（判別）模型是可以寫成公式的。但是另外一些回歸和分類的方法是體現(xiàn)在算法之中，其具體形式是計算機程序。廣義地說，算法模型實際上包含了經(jīng)典模型如果說，起源于前計算機時代的經(jīng)典統(tǒng)計目前大大受惠于計算機的發(fā)展，那么，沒有計算機，就不可能存在算法建模。在處理巨大的數(shù)據(jù)集上，在對付稱為維數(shù)詛咒的大量的變量上，在不用假定總體分布的情況時，在對付眾多競爭模型方面，算法建模較經(jīng)典建模有很多不可比擬的優(yōu)越性。決策樹：分類樹和回歸樹決策樹：分類樹和回歸樹例例(數(shù)據(jù)數(shù)據(jù)s

2、huttle.txt)library(MASS);shuttle1:10, 這個數(shù)據(jù)是關于美國航天飛機在各種條件下是否自動著陸的決策問題1。有256行及7列。頭六行為作為自變量的定性變量，而最后一列為因變量。自變量包括穩(wěn)定性(stability，取值stab/xstab)、誤差大小(error，取值(MM / SS / LX / XL)、信號(sign，取值pp / nn)、風向(wind，取值head / tail)、風力(magn，取值(Light / Medium / Strong / Out)、能見度(vis，取值yes / no)，因變量為是否用自動著陸系統(tǒng)(use，取值auto/n

3、oauto)。 1 數(shù)據(jù)源于D. Michie (1989) Problems of computer-aided concept formation. In Applications of Expert Systems 2, ed. J. R. Quinlan, Turing Institute Press / Addison-Wesley, pp. 310333.例例 (數(shù)據(jù)數(shù)據(jù)shuttle.txt).|vis=aerror=cstability=aauto auto noautonoautolibrary(MASS);shuttle1:10,m=256;set.seed(2);samp

4、=sample(1:m,floor(m/10);tsamp=setdiff(1:m,samp)library(rpart.plot);(b=rpart(use.,shuttle,subset=tsamp) ;b;plot(b);text(b,use.n=T)t(table(predict(b,shuttletsamp,type=class),shuttletsamp,7)vis=aerror=cstabilit=ababdbauto0.43auto0.00noauto0.86noauto0.60auto0.25noauto1.00noauto0.95rpart.plot(b,type=4,ex

5、tra=6,faclen=T); rpart.plot(b,type=0,extra=6,faclen=T);vis=aerror=cstabilit=aauto0.00auto0.25noauto1.00noauto0.95yesnorpart.plot(b,type=1,extra=6,faclen=T); rpart.plot(b,type=2,extra=6,faclen=T);vis=aerror=cstabilit=aauto0.43auto0.00noauto0.86noauto0.60auto0.25noauto1.00noauto0.95yesnovis=aerror=cst

6、abilit=aauto0.43auto0.00noauto0.86noauto0.60auto0.25noauto1.00noauto0.95yesno例例10.1 (數(shù)據(jù)數(shù)據(jù)shuttle.txt).t(table(predict(b,shuttletsamp,type=class),shuttletsamp,7)t(table(predict(b,shuttlesamp,type=class),shuttlesamp,7)kyphosis dataThe kyphosis data frame has 81 rows and 4 columns. representing data on

7、 children who have had corrective spinal surgery Kyphosis: a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation. Age: in months Number: the number of vertebrae involved Start: the number of the first (topmost) vertebra operated on. libr

8、ary(rpart.plot)fit - rpart(Kyphosis Age + Number + Start,data=kyphosis)fit2 - rpart(Kyphosis Age + Number + Start, data=kyphosis, parms=list(prior=c(.65,.35), split=information)fit3 =8.5Start=14.5Age=111absent 29/0absent 12/0absent 12/2present3/4present8/11|Start=12.5Age=8.5absent 56/6present8/11剪枝和

9、畫圖剪枝和畫圖par(mfrow=c(1,3), xpd=NA) ;rpart.plot(fit,type=2,extra=6)rpart.plot(fit2,type=2,extra=6);rpart.plot(fit3,type=2,extra=6);par(mfrow=c(1,1)Start=8.5Start=14Age=111absent0.21absent0.10absent0.00absent0.18absent0.00absent0.29absent0.14present0.57present0.58yesnoStart=12Age=8.5absent0.21absent0.10

10、present0.58yesnolibrary(rpart)fit - rpart(Kyphosis Age + Number + Start, data=kyphosis)predict(fit, type=prob) # class probabilities (default)predict(fit, type=vector) # level numberspredict(fit, type=class) # factorpredict(fit, type=matrix) # level number, class frequencies, probabilities預測預測librar

11、y(rpart)library(rpart.plot)data(kyphosis)kyphosis.rp =12.5Age 51.5Age=12Age52Age86absent0.21absent0.03absent0.44absent0.09present0.62absent0.43present0.71yesnorpart.plot(kyphosis.rp,type=2,extra=6 )library(rpart)kyphosis1 - kyphosis 71:81, predict(kyphosis.rp, kyphosis1, type=class)table(predict(kyp

12、hosis.rp, kyphosis1, type=class), kyphosis71:81,1) 預測預測(2)例例10.2 (例例9.5數(shù)據(jù)數(shù)據(jù)iris.txt). |Petal.Length 2.45Petal.Width 1.75setosa versicolorvirginica library(MASS);m=150;set.seed(10)samp - c(sample(1:50,25), sample(51:100,25), sample(101:150,25); tsamp=setdiff(1:m,samp);library(rpart.plot)(b=rpart(Spec

13、ies.,iris,subset=tsamp) ;plot(b);text(b,use.n=T)rpart.plot(b,type=2,extra=6)Petal.Le2.4Petal.Le4.8setosa0.33setosa0.00versicol0.50versicol1.00virginic0.11yesnoIRISt(table(predict(b,iristsamp,type=class),iristsamp,5) t(table(predict(b,irissamp,type=class),irissamp,5) (數(shù)據(jù)wine.txt) 這是關于意大利一個地區(qū)的葡萄酒數(shù)據(jù)1，該

14、數(shù)據(jù)是對該地區(qū)的三種不同培育品種的葡萄所釀造的酒的13種要素的化學分析結果，一共有178個觀測值。我們希望用這些數(shù)據(jù)來建立一個模型，利用這些要素的特性來判斷是那個品種的葡萄所釀造的酒。因變量是Class (品種)，有1、2、3個啞元取值，而13個自變量為：Alcohol (酒精)、 Malic.acid (蘋果酸)、Ash (灰分)、Alcalinity of ash (灰的堿性)、Magnesium (鎂)、 Total phenols (苯酚總量)、Flavanoids2、Nonflavanoid phenols(非flavanoid苯酚)、Proanthocyanins3、Color i

15、ntensity(顏色強度)、Hue(色調)、OD280/OD315 of diluted wines (稀釋的酒的蛋白質濃度的光譜度量)、Proline(脯氨酸)。根據(jù)這13個自變量，我們建立了下面的一個決策樹： w=read.table(f:/adbook/data/wine1.txt,header=T,sep=,)w$Class=factor(w$Class)library(rpart)(fit =755Flavanoids=2.165Censity=755Flavanoi=2.2Color.in4.820.4010.0610.0330.2520.6020.9030.07

16、yesno思考一下：思考一下：無論自變量是連續(xù)的還是定性變量，分類樹的原理都一樣。連續(xù)變量在根據(jù)需要離散化之后和分類變量就類似了。后面要介紹的回歸樹的因變量也離散化了（但也不是事先離散化），于是也有些類似于分類樹了?；貧w樹回歸樹當決策樹的輸出變量（因變量）是分類變量時，叫分類樹，而當決策樹的輸出變量為連續(xù)變量時稱為回歸樹。雖然回歸樹的因變量是連續(xù)變量，但葉節(jié)點數(shù)目是有窮的，因此輸出的值也是在這個葉節(jié)點上的觀測值的平均。回歸樹不用假定經(jīng)典回歸中的諸如獨立性、正態(tài)性、線性或者光滑性等等。它對于數(shù)量變量和定性變量都同樣適用。然而它需要更多的數(shù)據(jù)來保證合理的結果。例例 (數(shù)據(jù)數(shù)據(jù)B1.tx

17、t). 這是波士頓郊區(qū)的房價數(shù)據(jù)的一部分1。原數(shù)據(jù)有506個街區(qū)（觀測）及14個變量。我們這里只取了3個變量：人均犯罪率(crim)、較低地位人的比率(lstat)及平均房價(medv，單位：千美元)。我們試圖用crim和lstat來預測平均房價medv。首先，我們對變量crim進行對數(shù)變換1，圖10.3是變量crim在變換前后和其它兩個變量做散點圖的結果。從圖中可以看出，變換改變了點的分布極端不均勻的現(xiàn)象。對例對例9.3數(shù)據(jù)中的變量數(shù)據(jù)中的變量crim進行對數(shù)變換前后和另進行對數(shù)變換前后和另外兩個變量所做的散點圖。外兩個變量所做的散點圖。 B1=read.table(f:/hepbook

18、/data/Boston.txt,header=T)0204060801020304050crim變換前crimmedv020406080102030crim變換前crimlstat-4-20241020304050crim變換后log(crim)medv-4-2024102030crim變換后log(crim)lstat數(shù)據(jù)數(shù)據(jù)B1.txtlibrary(tree)(b=tree(medvlstat+crim,B1)2()yDeviance|lstat 9.725lstat 4.65crim -0.673987crim -0.941113lstat 5.495 crim -0.489198l

19、stat 16.085crim 1.7523837.4247.8629.9224.6237.3728.2020.3016.6811.98plot(b);text(b)102030-4-2024lstatcrim37.447.929.9 24.637.428.220.316.712.0plot(B1$lstat,B1$crim, xlab=lstat, ylab=crim); partition.tree(b, add = TRUE, cex = 1.5)思考一下：思考一下：回歸樹和分類樹有什么異同？回歸樹和經(jīng)典回歸有什么區(qū)別？回歸樹模型有沒有線性、非線性之分，為什么？組合方法：組合方法：

20、adaboost、bagging和隨機森林和隨機森林為什么組合？為什么組合？考慮某人欲競選某地領導，假定該地有49%的人不支持他。那么，每隨機問一個人，都有49%的可能不選他（相當于Bernoulli試驗）。如果從該地隨機選擇1000人來投票，按照簡單多數(shù)當選的原則，那么他不被選上的概率是多少呢？假定這次投票中不選他的票數(shù)服從參數(shù)為1000和0.49的二項分布；容易計算，這1000人中有超過半數(shù)的人（至少501人）不選他的概率為約0.2532498，遠小于某一個人不選舉他的概率0.49。 n=10000;p=.51s=seq(1,n,2)x=NULLfor(i in s)x=c(x,(pb

21、inom(floor(i/2),i,p,lower=F)plot(s,x,type=l)n=10000;p=.49s=seq(1,n,2)x1=NULLfor(i in s)x1=c(x1,(pbinom(floor(i/2),i,p,lower=F)plot(s,x1,type=l)僅僅是直觀描述僅僅是直觀描述02000400060008000100000.80.9sx02000400060008000100000.40.5sx1par(mfrow=c(1,2);plot(s,x,type=l);plot(s,x1,type=l)僅僅是直觀描述僅僅是直觀描

22、述par(mfrow=c(2,3);a=seq(.01,1,.01);n=c(5,11,51,101,1001,9999)for(i in 1:6)plot(a,1-pbinom(floor(ni/2),ni,a), type=l,main=substitute(n=q,list(q=ni),ylab=expression(paste(群體投票決策概率, , pg),xlab=expression(paste(個體決策概率, , p)abline(h=0.5,lty=3);abline(v=0.5,lty=3);segments(0,0,1,1,lty=2)僅僅是直觀描述僅僅是直觀描述0.00

23、.1.00.00.81.0n 5個體決策概率 p群體投票決策概率 pg0.00.81.00.00.81.0n 11個體決策概率 p群體投票決策概率 pg0.00.81.00.00.81.0n 51個體決策概率 p群體投票決策概率 pg0.00.81.00.00.81.0n 101個體決策概率 p群體投票決策概率 pg0.00.81.00.00.81.0n 1001個體決策概率 p群體投票決策概率 pg0.00.20

24、.0.00.81.0n 9999個體決策概率 p群體投票決策概率 pg僅僅是直觀描述僅僅是直觀描述Adaboost (adaptive boosting) 假定我們的目的是分類。開始可能用的是一種較弱的分類器（即出錯率較高的分類器），然后，隨著迭代，不斷地通過加權再抽樣改進分類器，每一次迭代時都針對前一個分類器對某些觀測值的誤分缺陷加以修正，通常是在（放回）抽取樣本時對那些誤分的觀測值增加權重（相當于對正確分類的減少權重），這樣就形成一個新的分類器進入下一輪迭代；而且在每輪迭代時都對這一輪產(chǎn)生的分類器給出錯誤率。最終結果由各個階段的分類器的按照錯誤率加

25、權（權重目的是懲罰錯誤率大的分類器）投票產(chǎn)生。這就是所謂的“自適應”。Adaboost Adaboost的缺點是對于奇異點或離群點比較敏感，但其優(yōu)點是對于過擬合則不那么敏感。 Adaboost與boosting (助推法助推法)有很多相似之處，這些方法衍生出來了一系列方法；各種方法都有其優(yōu)缺點。IRISlibrary(adabag);library(rpart);set.seed(0)samp - c(sample(1:50,25), sample(51:100,25), sample(101:150,25) a=adaboost.M1(Species.,data=irissamp,mfin

26、al=15, maxdepth=5)a.pred - predict.boosting(a,newdata=iris-samp,) ;a.pred-1a.predt - predict.boosting(a,newdata=irissamp,);a.predt-1IRIS：變量重要性：變量重要性Sepal.LengthSepal.WidthPetal.LengthPetal.Width010203040barplot(a$importance)例例9.5 (數(shù)據(jù)Vehicle.txt) 846個觀測值，描述了19個變量的數(shù)據(jù)1。我們希望用其中18個關于汽車的度量（數(shù)量變量）來把汽車分類。數(shù)據(jù)中

27、最后一個變量class為已知的類別（四種：bus、opel、saab、van）。例例9.5 (數(shù)據(jù)Vehicle.txt) library(mlbench);data(Vehicle);n - length(Vehicle,1)samp - sample(1:n,n/2);a=adaboost.M1(Class .,data=Vehiclesamp, ,mfinal=25, maxdepth=5)a.pred - predict.boosting(a,newdata=Vehicle-samp, );a.pred-1b.pred - predict.boosting(a,newdata=Vehi

28、clesamp, );b.pred-1barplot(a$importance)CompD.CircPr.Axis.RaScat.RaPr.Axis.RectSc.Var.maxisSkew.maxisKurt.Maxis024681012思考一下：思考一下：我們用的軟件包adabag在輸出中除了結果的分類預測之外還有每次迭代時的分類樹。以及這些樹對每個觀測值的“投票。” 雖然adaboost的方法描述比較羅唆，但使用起來和其它分類方法沒有多大區(qū)別。Bagging (bootstrap aggregating) 它利用了自助法(bootstrap)放回抽樣。它對訓練樣本做許多次（比如k次）放

29、回抽樣，每次抽取和樣本量同樣的觀測值（但由于是放回抽樣，大約有三分之一的觀測沒有抽到）；于是就有了k個不同的樣本。然后，對每個樣本生成一個決策樹。這樣，每個樹都對一個新的觀測值產(chǎn)生一個預測。如果目的是分類，那么由這些樹的分類結果的多數(shù)（“投票”）產(chǎn)生bagging的分類；如果目的是回歸，則由這些樹的結果的平均得到因變量的預測值。例例9.6 (繼續(xù)例9.5數(shù)據(jù)Vehicle.txt). n=length(Vehicle,1)samp=sample(1:n,n/2)a=bagging(Class .,data=Vehiclesamp, ,mfinal=25, maxdepth=5)a.pred

30、=predict.bagging(a,newdata=Vehicle-samp, );a.pred-1b.pred=predict.bagging(a,newdata=Vehiclesamp, );b.pred-1barplot(a$importance)CompD.CircMax.L.RaElongSc.Var.MaxisSkew.MaxisKurt.Maxis024681012思考一下：思考一下： Bagging的各個子分類器是同等權數(shù)投票，而adaboost的各個分類器是根據(jù)各個分類器的表現(xiàn)加權投票。想想為什么每次自助法抽樣有大約三分之一的觀測值不出現(xiàn)在自助法樣本中。隨機森林隨機森林

31、隨機森林也是進行許多次自助法放回抽樣；它的樣本數(shù)目要大大多于bagging 此外，在生成樹的時候，在每個節(jié)點都僅僅在隨機選出的少數(shù)變量中選擇。不但樣本是隨機的，每棵樹、每個節(jié)點的產(chǎn)生都有很大的隨機性。隨機森林讓每個樹盡量增長，而且不進行修剪。隨機森林隨機森林優(yōu)點它比它產(chǎn)生以前所有的方法都精確。此外，對于大的數(shù)據(jù)庫，它很有效率。它不懼怕很大的維數(shù)，即使是數(shù)千變量，它也不必刪除變量。它還給出分類中各個變量的重要性。隨著森林的增長，它產(chǎn)生一個內部無偏的一般誤差的估計。它有一個有效的方法來估計缺失值，同時在很大比例數(shù)據(jù)缺失時仍然保持精確。在一個關于淋巴瘤的基因芯片數(shù)據(jù)中，在一個關于

32、淋巴瘤的基因芯片數(shù)據(jù)中，變量個數(shù)可以達到變量個數(shù)可以達到4682個，而樣本量個，而樣本量僅有僅有81個；這種數(shù)據(jù)在經(jīng)典統(tǒng)計中根個；這種數(shù)據(jù)在經(jīng)典統(tǒng)計中根本無法處理，本無法處理，Diaconis & Efron(1983)1曾經(jīng)說過，曾經(jīng)說過，“統(tǒng)計經(jīng)驗統(tǒng)計經(jīng)驗表明，基于表明，基于19個變量和僅僅個變量和僅僅155個數(shù)據(jù)個數(shù)據(jù)點來擬合模型是不明智的。點來擬合模型是不明智的。”但是隨但是隨機森林可以很好地找到重要的基因機森林可以很好地找到重要的基因2 例例9.7 (繼續(xù)例9.3數(shù)據(jù)B1.txt). crimzninduschasnoxrmagedisradtaxptratioblackls

33、tat05101520253035crimzninduschasnoxrmagedisradtaxptratioblacklstat02000600010000library(randomForest)B1=read.table(f:/hepbook/data/Boston.txt,header=T)#MASSattach(B1)w=randomForest(medv ., data=B1, importance=TRUE,proximity=TRUE)par(mfrow=c(1,2)for(i in 1:2)barplot(t(importance(w)i,s = 0.5)p

34、rint(w)隨機森林：隨機森林：IRISSepal.LengthSepal.WidthPetal.LengthPetal.Width0.00.51.01.52.02.53.03.5Sepal.LengthSepal.WidthPetal.LengthPetal.Width01234set.seed(117)w=randomForest(Species ., data=iris, importance=TRUE,proximity=TRUE)par(mfrow=c(1,2);for(i in 1:2)barplot(t(importance(w)i,s = 0.7)隨機森林隨機

35、森林:IRIS-0.12-0.10-0.08-0.06-0.04-0.050.000.050.10-0.4-0.20.00.20.4-0.4-0.20.00.20.4par(mfrow=c(1,2)aa=eigen(w$proximity)2plot(aa,1:2,pch=(1:3)as.numeric(iris$Species),cex=0.7,main=,xlab=,ylab=)MDSplot(w, iris$Species, palette=rep(1, 3), pch=as.numeric(iris$Species),xlab=,ylab=)例例9.9 (數(shù)據(jù)airquality.tx

36、t) 1973年紐約市從五月到九月的空氣質量數(shù)據(jù)1。變量有臭氧(Ozone，單位ppb) 、陽光輻射(Solar.R, 單位lang)、風力(Wind, 單位mph)、溫度(Temp, 單位F)、月(Month)、日(Day)。我們希望把臭氧作為因變量，用氣象數(shù)據(jù)作為自變量通過隨機森林進行回歸。例例9.9 (數(shù)據(jù)airquality.txt) Solar.RWindTempMonthDay010203040Solar.RWindTempMonthDay0100003000050000ozone.rf=randomForest(Ozone ., data=airquality, mtry=3,i

37、mportance=TRUE, na.action=na.omit)par(mfrow=c(1,2);for(i in 1:2) barplot(importance(ozone.rf),i)(該數(shù)據(jù)在randomForest中固有)思考一下：思考一下：隨機森林、bagging、adaboost等方法都是基于決策樹的組合方法。但這些方法和傳統(tǒng)統(tǒng)計思維有很大的不同。你能夠說出你的感覺嗎？就你的理解，這些方法有什么優(yōu)缺點？最近鄰方法最近鄰方法最近鄰方法最近鄰方法(nearest neighbor algorithm)可能是所有算法建模中最簡單的方法。它基于訓練集對測試集進行分類或回歸。每個回

38、歸或分類問題都有一些自變量，它們組成一個多維空間。首先在空間中假定一個距離。在連續(xù)型自變量的情況，通常都是用歐氏空間。在分類問題中，一個測試集點應該屬于離它最近的k個訓練集點中多數(shù)所屬于的類型。在k=1的最簡單的情況，那么該點的類型應該和與它最近的一個點相同。在回歸中，一個測試集點的因變量的預測值應該等于離它最近的k個訓練集點相應的因變量值的平均。至于k的選擇，一般都用測試集的交叉驗證來進行。例例9.10 (數(shù)據(jù)radar.txt). 關于雷達是否被電離層反射的具有351個觀測值的數(shù)據(jù)1。這里自變量有34個，因變量class為分類變量，它取兩個值：一個是的確被電

39、離層反射，標為“好”(good)，否則就是“不好”(bad)。我們在這里試圖用最近鄰方法來建立分類模型用那34個自變量預測這個分類變量。我們還是隨機取約一半數(shù)據(jù)作為訓練集，其余為測試集。例例9.10 (數(shù)據(jù)radar.txt). library(kknn);data(ionosphere)n=nrow(ionosphere);set.seed(1);test=sample(1:n,n/2)a=kknn(class ., ionosphere-test, ionospheretest,)table(ionospheretest,$class, a$fit)library(mlben

40、ch);data(BreastCancer)w=BreastCancer例例3.3 (BreastCancer.txt)這是加州大學伯克利分校數(shù)據(jù)庫1提供的關于威斯康星大學麥迪遜分校醫(yī)院的乳腺癌數(shù)據(jù)(其應用參見Wolberg & Mangasarian, 1990和Zhang,J., 1992)。一共有699個觀測值和11個變量。目標變量（因變量）是關于病人是良性(benign)還是惡性(malignant)腫瘤的二分變量class，而協(xié)變量包括病人的ID、腫塊厚度(Cl.thickness)、細胞大小(Cell.size)、細胞形狀(Cell.shape)、邊緣粘連(Marg.adh

41、esion)、單獨上皮細胞大小(Epith.c.size)、裸細胞核(Bare.nuclei)、淡染色質(Bl.cromatin)、正常細胞核(Normal.nucleoli)、分裂激素(Mitoses)。這些協(xié)變量都是數(shù)量型變量，除了ID之外，都轉換成了1到10的整數(shù)。整個數(shù)據(jù)有16個缺失值。我們把缺失值去掉，隨機選擇三分之一的數(shù)據(jù)作為測試集，而剩下的三分之二作為訓練集。選擇k=5，利用triangular kernel作為核函數(shù)進行加權KNN分類，對于測試集的預測結果在下面表中：library(mlbench);data(BreastCancer);w=BreastCancerlibrar

42、y(kknn)w=w-1w=na.omit(w)m - dim(w)1#m=699-16=683set.seed(10)val - sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m) w.learn - w-val,;dim(w.learn)#455*10w.valid - wval,;dim(w.valid)#228*10w.kknn - kknn(Class., k=5, w.learn, w.valid, distance = 1,kernel = triangular)summary(w.kknn)fit

43、- fitted(w.kknn)#228table(w.valid$Class, fit)-2-101234-2-10123x1x2library(ElemStatLearn)str(mixture.example) #class(mixture.example);attributes(mixture.example)$names1 x y xnew prob marginal px1 7 px2 means x - mixture.example$x g - mixture.example$y x.mod - lm( g x) # Figure 2.1: plot(x, col=ifelse

44、(g=1,red, green), xlab=x1, ylab=x2) coef(x.mod) abline( (0.5-coef(x.mod)1)/coef(x.mod)3, -coef(x.mod)2/coef(x.mod)3) ghat 0.5, 1, 0)#?ifelse length(ghat) sum(ghat = g)1 - sum(ghat=g)/length(g) #1 0.27 # Training misclassification rateThe Elements of Statistical Learning, Data Mining, Inference, and

45、Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman.-2-101234-2-10123x1x2 xnew - mixture.example$xnew dim(xnew) colnames(xnew) library(class) mod15 - knn(x, xnew, g, k=15, prob=TRUE) summary(mod15) #Figure 2.2: plot(x, col=ifelse(g=1,red, green),xlab=x1, ylab=x2) str(mod15) prob - att

46、r(mod15, prob) prob - ifelse( mod15=1, prob, 1-prob) # prob is voting fraction for winning class! # Now it is voting fraction for red=115-nearest neighbourx1x2 -2-101234-2-10123px1 - mixture.example$px1 px2 - mixture.example$px2 prob15 - matrix(prob, length(px1), length(px2) contour(px1, px2, prob15

47、, levels=0.5, labels=, xlab=x1, ylab=x2, main= 15-nearest neighbour)# adding the points to the plot: points(x, col=ifelse(g=1, red, green) ghat15 - ifelse(knn(x,x,k=15, cl=g)=1, 1, 0) sum(ghat15=g) # 1 169 1 - sum(ghat15=g)/length(g) # 1 0.155 # Misclassification rate for knn(, k=15)1-nearest neighb

48、ourx1x2 -2-101234-2-10123# Then we want the plot for knn with k=1: (Figure 2.3) mod1 - knn(x, xnew, k=1, cl=g, prob=TRUE) prob - attr(mod1, prob) prob - ifelse( mod1=1, prob, 1-prob) # prob now is voting # fraction for red prob1 - matrix(prob, length(px1), length(px2) ) contour(px1, px2, prob1, leve

49、l=0.5, labels=, xlab=x1, ylab=x2, main= 1-nearest neighbour) # Adding the points to the plot: points(x, col=ifelse(g=1, red, green) # Reproducing figure 2.4, page 17 of the book: # The data do not contain a test sample, so we make one, # using the description of the oracle page 17 of the book: The c

50、enters # is in the means component of mixture.example, with green(0) first, # so red(1). For a test sample of size 10000 we simulate # 5000 observations of each class. library(MASS) set.seed(123) centers - c(sample(1:10, 5000, replace=TRUE), sample(11:20, 5000, replace=TRUE) means - mixture.example$

51、means means - meanscenters, mix.test - mvrnorm(10000, c(0,0), 0.2*diag(2) mix.test - mix.test + means cltest - c(rep(0, 5000), rep(1, 5000) ks - c(1,3,5,7,9,11,15,17,23,25,35,45,55,83,101,151 ) # nearest neighbours to try nks - length(ks) misclass.train - numeric(length=nks) misclass.test - numeric(

52、length=nks) names(misclass.train) - names(misclass.test) - ks for (i in seq(along=ks) mod.train - knn(x,x,k=ksi,cl=g) mod.test - knn(x, mix.test,k= ksi,cl= g) misclass.traini - 1 - sum(mod.train=factor(g)/200 misclass.testi - 1 - sum(mod.test=factor(cltest)/10000 print(cbind(misclass.train, misclass

53、.test)# Using package mclust02 # Note that this package is no longer on CRAN,# but must be searched in the archives. if(require(mclust02) x - mixture.example$x g - mixture.example$y xnew - mixture.example$xnew px1 - mixture.example$px1 px2 - mixture.example$px2 mix.mclust - mclustDA(x, g, xnew, G=1:

54、6, verbose=TRUE) mix.mclust # end require (mclust02)載入需要的程輯包：mclust02Failed with error: mclust02不是有效的已經(jīng)安裝了的程序包失敗失敗0.000.000.250.30Number of NNTest error135791117254583151train test # Figure 2.4plot(misclass.train,xlab=Number of NN,ylab=Test error,type=n,xaxt=n)axis(1, 1:length(ks), as.ch

55、aracter(ks)lines(misclass.test,type=b,col=blue,pch=20)lines(misclass.train,type=b,col=red,pch=20)legend(bottomright,lty=1,col=c(red,blue),legend = c(train , test )#Figure 2.5prob-mixture.example$probprob.bayes - matrix(prob, length(px1), length(px2)contour(px1, px2, prob.bayes, levels=0.5, labels=,

56、xlab=x1, ylab=x2, main=Bayes decision boundary)points(x, col=ifelse(g=1, red, green)Bayes decision boundaryx1x2 -2-101234-2-10123思考一下：思考一下：最近鄰方法是最簡單最直觀的方法。似乎俗語“物以類聚，人以群分”可以描述這個方法。最近鄰方法最常用作法的不是讓最近的k個點簡單地投票，而是加權，即讓更接近的點的投票分量越重。你可以思考一下加權的方法。人工神經(jīng)網(wǎng)絡人工神經(jīng)網(wǎng)絡人工神經(jīng)網(wǎng)絡神經(jīng)網(wǎng)絡(Artificial Neural Networks)是對自然的神經(jīng)網(wǎng)絡

57、的模仿；它可以有效地解決很復雜的有大量互相相關變量的回歸和分類問題。下面就是一個有兩個自變量（輸入）和一個因變量（輸出）的神經(jīng)網(wǎng)絡的示意圖。左邊代表自變量的兩個節(jié)點左邊代表自變量的兩個節(jié)點(node)形成輸入層形成輸入層(input layer)，中，中間三個節(jié)點形成隱藏層間三個節(jié)點形成隱藏層(hidden layer)，最右邊的一個節(jié)點屬于，最右邊的一個節(jié)點屬于輸出層輸出層(output layer)，代表因變量。這些節(jié)點按照箭頭連接。，代表因變量。這些節(jié)點按照箭頭連接。最常用的隱藏層激活函數(shù)為S型logistic函數(shù) 關鍵在于權重，需要計算誤差，以反饋回上一層節(jié)點，調整權重復雜的人工神經(jīng)網(wǎng)絡復雜的人工神經(jīng)網(wǎng)絡library(nnet); library(mlbench); data(Vehicle); n=length(Vehicle,1); set.seed(1); samp = sample(1:n,n/2); b=class.ind(Vehicle$Class); test.cl=function(true, pred) true - max.col(true); cres= max.col(pred); table(true, cres); a=nnet(Vehicl

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經(jīng)權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

現(xiàn)代回歸和分類

文檔簡介

溫馨提示

最新文檔

評論

現(xiàn)代回歸和分類

文檔簡介

溫馨提示

最新文檔

評論

相關文檔