數(shù)據(jù)挖掘復(fù)習(xí)題和答案_第1頁
數(shù)據(jù)挖掘復(fù)習(xí)題和答案_第2頁
數(shù)據(jù)挖掘復(fù)習(xí)題和答案_第3頁
數(shù)據(jù)挖掘復(fù)習(xí)題和答案_第4頁
數(shù)據(jù)挖掘復(fù)習(xí)題和答案_第5頁
已閱讀5頁,還剩15頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

CI0C26(5/6)CI0C26(5/6)R-J1C25P(CI)=P(C2)=5/6EntrOPy=-(1/6)Iog2(1/6)-Iog2(5/6)=0.65考慮表中二元分類問題的訓(xùn)練樣木集表4-8練習(xí)3的數(shù)據(jù)集宴例a603目標(biāo)類ITT1.0+2TT&0+3TF5.04FF4Q+5FT7.06FT3.07FF&08TF7.0+9FT5.0整個訓(xùn)練樣本集關(guān)于類屬性的嫡是多少關(guān)于這些訓(xùn)練集中al,a2的信息增益是多少對于連續(xù)屬性a3,計算所有可能的劃分的信息增益。根據(jù)信息增益,al,a2,a3哪個是罠佳劃分*根據(jù)分類錯誤率,al,a2哪具最正確根掩ini指標(biāo),al,a2哪個最正確答1?EXamPIeSforCOmPUtingEntrOPyEntrOPy"二-工p(j\/)Iog2p(j\/)P(CI)=0/6=0P(C2)=6/6=1EntrOPy=-0IOg0-I10gl=-0-0=0CI2C24P(C1)=CI2C24P(C1)=P(+)二4/9andP(-)=5/9P(C2)=4/6EntrOPy=一(2/6)log?(2/6)-(4/6)Iog2(4/6)=0.92-4/9log(4/9)-5/9log(5/9)二?答2:SP1ittingBaSecIOnINFO.???InfOrmatiOnGain:GAIN..-EntrOPy(P)--Entropy(I)ParentNode,PisSPIitintokPartrtiOns;niisnumberOfrecordsinPartitiOni一MeaSUreSRedUCtiOninEntrOPyachievedbecauseOftheSPIit?ChOOSetheSPlitthatachievesmostreduction(maximizesGAIN)一USedinID3andC4.5一DiSadVantage:TendStoPreferSPlitSthatresultinlargenUmberOfPartiti0ns,eachbeingSmallbutPUre?〔估計不考〕FQrattributeQi5theCOrreSPOlldingCoulltSandPrObabilitieSare:5十■T31F14Theentropyforayis-(3/4)lDg2(3/4)-(l/4)log:-(l/5)lDg2(l/5)一(4/5)lag:TherefoTertheinformationgainforAxis0.9911—0.7GIG二0.2294.FOrattributeQ2,theCOrreSPOndingCOlnltSandProbabilitiesare:d*2+一T23F22TileentropyforQ2is計一(2/5)lDg2(2/5)-(3/5)lDg2(3/5)+ -(2/4)log:(2/4)-(2/4)log:(2/4)=0.9839.TIIerefbre£theinformationgainforis0,9911—b0.9839二0,0072,答3:COntinUOUSAttributes:COmPUtingGiniIndex...FOrefficientCOmPUtation:foreachattribute,一SOrttheattributeOnValUeS一LinearlySCanthesevalues,eachtimeUPdatingtheCOUntmatrixandCOmPUtingginiindex一ChOOSetheSP1itPOSitiOnthathastheIeaStginiindexa-3ClaSSlabelSPlItPointEntrOPyInfOGaLirl1.0I-2.00.84840.14273.0一3.50.98850.00264.0a-3ClaSSlabelSPlItPointEntrOPyInfOGaLirl1.0I-2.00.84840.14273.0一3.50.98850.00264.0+4.50.918i0.0728Eliog550.98390.00726.0—?6.50.97280.01837.07.0+7.50.888&0.1022答4:ACCOrdingtoinformationgain,"producesthebestSPIit?答5:EXamPIeSforCOmPUtingErrOrClOC26P(C1)=0/6=0 P(C2)=6/6Errors1一max(0,1)=1-1Cl1C25P(C1)=1/6P(C2)=5/6Error=1—max(1/6,5/6)=1Cl2C24P(C1)=2/6 P(C2)=4/6Error=1—max(2/6,4/6)=1ErrOr(J^I-maxP(J\t)答6:一5/6二1/6FOrattributea:errorrate=2/9.-4/6=1/3FOrattributea:errorrate=4/9.-4/6=1/3Therefore,accordingtoerrorrate,producesthebestSPIit.

BinaryAttributes:COmPUtingG1NIIndeX?SPIitSintotwoPartitiOnS?EffeCtOfWeighingPartitions:一Larger2ndPUrerPartitiOnSareSOUghtfor.ParentClmir*Jttj=0?00■NIN2CININ2CI51C224Gini=0.333Gini(Children)二7/12*0.408+5/12*0.32=0.3714/18/2004 34Gini(NI)=1_(5/7)2_(2/7)2=0?408Gini(N2)=1-(1/5)2-(4/5)2=0.32I"TantSteinbachKUmarIntrOdUVfcntoDataMininaForattributeautheginiindexis片(1/5)2片(1/5)2_(4/5)2=o-a444.s1一(3/4)2—(1/4)2+-1一ajJFOrattribute?2?theginiindexisR 4r;1-(2/5)-(3/5)2+c1-(2/4)2_(2/4)2=o.488&.SinCetheginiindexforafissmaller,itPrOduCeSthebettersplit.考慮如下二元分類問題的數(shù)據(jù)集AB類標(biāo)號TF+TTTT■TFTT+FFFFFFTTTF3二元分類問題不純性度量之間的比擬計算信息增益,決罠樹歸納算法會選用哪個屬性ThGCOntingenCytablesaft.erSPIittingOnattributesAandBarc:A=TA=F40A=TA=F4033B=TB=F3115TheOVerallentropybeforeSPIittingis:Eorig二—0.4log0.4—O.Glog0.C二0.9710TheinformationgainafterSPIittingOnAis:EJ二-IlOgf-IIogl二0=23. 3 ()1 0-nEA=P=—IOgmmSg_0EA=P=A二Eg-7/WE,=T-3/lOV^F二0.2813ITheinformationgainafterSPIittingOrlBis:3 3IIEB二T二-T1°gT—T1°g[=0-8113J5EP=F= IOg——————log—二0.6500Z\=E^g-4/10Eg-6/10EBJ=(),2565Therefbre?attributeAWillboChoSCTItoSPlitthenode.計算gini指標(biāo),決策樹歸納會用哪個屬性TheOVerallginibeforeSPlittingis:Gon:~1-0.4~-0.6"二0.48Th€?gaininginiafterSPIittingOnAis:ThegainGB二T二I-Q):-Q)2二0.37506〃二】二(I)〃(I)〃27784二G“ig—4/10GB==T—6/10GBJ=0.1633Therefore,attributeBWillbeChOSeIItoSPIitt?henode?這個答案沒問題從圖4T3可以看出炳和gini指標(biāo)在[0」都是單調(diào)遞增,而[門之間單調(diào)遞減。有沒有可能信息增益和gini指標(biāo)增益支持不同的屬性解釋你的理由YeS-eventhoughthesemeasureshaveSimiIarrangeandmon0t0r)0US%behavior,theirrespectivegains,A,WhiChareSCaleddifferencesOfthemeasures,donotnecessarilybehaveintheSameway,asilIUStratedbytheresultsinPartS(a)and(b)?貝葉斯分類EXamPIeOfNaYVeBayeSClaSS辻ierGiVenaTeStRecord:X二(Refund二No,Married?InCOme二120K)naiveBayeSClassifier:P(RefUnd二YeSINO)=3/7?P(XlCldSS二NO)二P(Refund二NOlClaSS二NO)P(RefUnd二NOINO)=4/7XP(MamedICIaSS=NO)P(RefUnd二YeSlYeS)=0XP(Income二120K|ClaSS二NO)P(RefUnd二NOlYeS)=1二4>7X4/7x0.0072二P(MarttaiSlalieSingelNO)二2〃P(MaritalStaigDiV0rCeClN0)=V70.0024P(MaritalSlatleMafriedINO)=4/7P(MaritalStatlesing€YeS)=2/7?P(XICIaSS=YeS)=P(RefUnd二NOIClaSS二YeS)P(MaritalStatU5DiVorCeCIYeS)=1/7XP(MarrledlClaSS二YeS)P(MantaISlalUSManlealYeS)=0XP(InC0me=120KIClaSS二YeS)FOrtaXaDleIrxomeIfClaSS=NOSamPIemean=iioSamPle=1x0X1.2xW9二0Varlance=2975SinCeP(XINO)P(NO)>P(XIYeS)P(YeS)ITClaSS=Yes*SamPIemean=90SamPIeThereforeP(N01X)>P(YeSIX)varlance=25—actnr*—XTrv?Tan,SteinDach.KUmaf InUOdUCtiOnlODataMininQ 4/ia^2004 667.考慮三540中的數(shù)據(jù)集。勻慝7茁數(shù)抿建Td■:ABc1000+2001■3011—40115001■6101+71018101—?91I1+10101(a)估計條件概率P(Aj+),P(B|+),P(Q+),P(A卜),P(EH)和P(CITlb)根據(jù)(a)中的條件概率,使用樸素貝葉斯方法預(yù)測測試樣本(A=0,0=1,C=0)的類標(biāo)號。(C)便用m佔計方法(p=l/2且加二4)估計條件概率。@)同?),使用(C)中的條件概率。@)比擬估計概率的兩種方法。哪一種更好?為什么?PU=1/-)=2/5二,P(B二1/-)二2/5二,P(CP(BP1BP{A尸?0/一)=3/5二.P(CP(BP1BP{A尸?0/一)=3/5二.LotP(A=OrZ?1/-)=1,P^A0/-)=3/5=,1/+)=1/6=,0/弓)二2/5二,=1,cr=0)K.=0/-)=3/5=,P(C二0/-)=0;P(A=1/+)=3/5=,P(C=1升)=2/5=,P(B二0A)=4/5二,P(+\A二(KZ?二1、C=0)_P(A二0,5=LC=0|+)XF(+)二P(A=Z?=1,C=0)_P(A=()1+)F(B二1|+)P(C=0|+)XP(+)k~二().4X0.2X().6X0.5//<=().024/K?P(-\A=0.B=1.C=P(A=0m=1,C£=()I-)Xr(-)二 P(A二0.Z?二LC二0)_P(A=Q\X^65=1|-)XP(C=Q-)XF㈠K二0/7VTheClaSSIaI)ClShOUIdbe*+\P(A=0/+)=(2+2)/(5+4)=4/9,PM二0/-)=(3+2)/(5+4)=5/9,P(B=1/+)=(1+2)/(5+4)=3/9,P{B=1/-)=(2+2)/(5+4)=4/9,

P(C=0/+)=(3+2)/(5+4)=5/9,P(C=0/-)=(0+2)/(5+4)=2/9.LetP(A=0,5=1,C=0)=KP(+I4=OR=LC=0)_ P(4=(35P=I5C=0|4-)XP(4-)= n(q=o,ZJ=I,o=0)_P(A二UI+)〃(B二1I+)P(C二UI+>:XP(+)K(d/Q)X(3/9)X(5/9)X(15二 K=0.0412/KP(-IA=0"=1,C=U)_P(A=0:〃二IC=01—)乂P(-)0I-)XP(_)二 0I-)XP(_)P(A=U|-)XP(B=1|-)XP(C二 K(5/9)X(4/9)X(2/Q)x0.5二 K=0.0274//<TheClaSSlabelSholLldbe"+\5當(dāng)?shù)臈l件概率之一是零,則估計為使用m-估計概率的方法的條件概率是史好的,因為我們不希望整個表達(dá)式變?yōu)榱恪?考慮表,11中的數(shù)據(jù)集。?5-11習(xí)題8的數(shù)據(jù)集實例ABc類1001—12101+§010—4100■5101+6001+71L0一8000—9010+101114估計條件概率P(A=11+),P(B=II+),P(C=I1+),P(A=II-),P(B=1卜)和P(C=IU根據(jù)(a)中的條件槪率,使用樸素貝葉斯方法預(yù)測測試樣本3二l,B=l,C~i)的類標(biāo)號。(C)比擬P(A=I), P(A1,B=1)?陳述A、〃之間的關(guān)系。對P(A=1),P(B=0)和戶%=UB=0)重復(fù)(C)的分析。比擬P(A=: 1,H11類二+)與P(A=I1類二+)和P(B二II類*)。給定類+,^A>B條件獨立嗎?P{A=1/+)=,p血二1/+)二,pQ二1/+)二,p{a-1/-)=,P(B=I/-)=,andP(C=1/-)=2.LetR:(.A=t£B-C=I)bethetestrecord.TOdetermineitsclass,v/eneedtoCOmPUtePalR)andP{-IR)-USingBayeStheorem,P"IR)=PIR1HPW/P(R)andP(-IR)=P(RlmPe?SinCeP(+)=P(-)=andP(RisCOnStant,RCanbeClaSSi.fiedbyCOmParingPalR)andP{-IR)?FOrthisquestion,PIRiH=PU=I/+)XP(B=\/+)XP(C=\2=PIR1-)=P(彳二1/-)XP(B=\ 卜)XP(C二H~)=SinCeP(Rimislarger,therecordiSassignedto(+)class?3?P(A=1)=,P(B=1)=andP{A=I"=I)=P(A)XP{ff)=?Therefore,AandBareindependent?4.P{A=1)=£P(guān)(B=0=,and=1,F=0)=PIA=1)\P(B=0)=-AandBareStiIIindependent.5.

COmPareP{A二Ij二I/+)=againstP(A二1/+)=andP(B二HClaSS=+)=?SinCethePrOdUCtbetweenP(A二1/+)andP(A二1/-)arenottheSameasP(A-1,5=1/-r),AandBarenotCOnditiOnallyindependerttgiventheClaSS?三?使用下表中的相似皮矩陣進(jìn)行單璉和全鏈展次聚類。繪制樹狀況顯示結(jié)果,樹狀圖應(yīng)該淸楚地顯示合并的次序。Table8.1.SimilantymatrixforEXerCiSe16.(a)Singlelink.(b)COmPIetelink.Table8.1.SimilantymatrixforEXerCiSe16.(a)Singlelink.(b)COmPIetelink.2.考慮表6>22中顯示的數(shù)據(jù)集。表6吆2購物籃事務(wù)的例子顧客ID事務(wù)ID購置項1OOol{atd9e}10024.{atbfC720012{ciMe}20031{apc/i,e)30015{6c?e}30022{M>ej4002940040M,c}5003350038將每個事務(wù)ID視為一個購物籃,計算項集{e}?{b?d}和{b-de}的支持虔。使用(町的計算結(jié)果,計算關(guān)聯(lián)規(guī)則{b,d}-{e丿和何一&刃的置信度。置信度是對稱的度量嗎?(C)將每個顧客ID作為一個購物籃,重復(fù)(a)。應(yīng)當(dāng)將每個項看作一個二元變量(如果一

個頊在顧客的購置事務(wù)中至少出現(xiàn)了一次,則為h杏則,為0)。9)便用(C)的計算結(jié)果,計算關(guān)聯(lián)規(guī)則2,N}f何和何一{方,刃的置信度。(e)假定印和G是將每個事務(wù)ID作為一個購物籃時關(guān)聯(lián)規(guī)則r的支持度和宣信度,而也和C?是將每個顧客ID作為一個購物籃肘關(guān)聯(lián)規(guī)則:r的支持度和置信度。討論S1和$2或G和Q之間是否存在某種關(guān)系?s({e})=0012121■OO?=100%就{〃})二=100%坯({6,d,e})=NsconfidenceisnotaSyTnmetriCmeasure?c(bd->C)C(C——C(C——>bd=仆})=7=0.8012二25%s(r/})二is({b,d,E})T=0.8?>c(bd—>e)c(e—>bd)TherearenoapparentrelatiOnShiPSbetweens,s,Csand6.考慮表6吆3中顯示的購物籃事務(wù)。表6?23購物籃事務(wù)事務(wù)ID購置壩123456789{牛奶?啤酒.尿布} {回包,黃泊,牛奶} {牛奶?尿布.餅干}{面包?黃餅千}{啤酒?餅干,尿布}{牛奶?尿布.面包,黃溝}{面包?黃油,尿布}{咤酒,尿布}{牛奶,尿布?面包,賀油}〔呻酒?餅干}從這些數(shù)據(jù)中,能夠提取出的關(guān)聯(lián)規(guī)則的最大數(shù)量是多少(包括零支持度的規(guī)則)?能夠提取的頻繁項集的最大長度是多少(假定最小支持度>0)?(C)寫出從該數(shù)據(jù)集中能夠提取的3瀕集的最大數(shù)量的表達(dá)式。找出一個具有最大支持度的項集(長度為2或更大)。找出一對項a和力,使得規(guī)則{a]~[b}和{6}f{a}具有相同的置信度。WhatistheIllaXimllmninnberOfassociationrulesthatCanbecttactedfromthis(lata(includingrulesthathaveZCrQsupport)?Answer:ThereareSiXitemsinthe(lataset.ThereforethetotalnumberOfrulesisG02?WhatisthemaximumSiZeOffrequentitomsotsthat.Canbeextracted(assumingminsup>0)?Answer:BeCallSetheIolIgeSttransartionContainS4items,theinaxi-IlnllnSiZeOffrequentitemsetis4.?C:,WrLteanexpressionforthemaximumIlILmberOfSiZe-3itemsetsthatCailbederivedfromthisdataset.Answer:(;)二20.FiTldAnitemsot(OfSiZ€20TIaTgOr)thathastheIargeStsupport-Answer:{Bread?Butter}?FindaPairOfitems,?andb?SUChthattherules{a}——{6}aiid~>{a}havetheSalneCOlIfidCnCeAnswer:(BeCrjCOOkieS)Or(Bread,Butter)

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論