第九講 分類與預(yù)測(cè)_第1頁(yè)
第九講 分類與預(yù)測(cè)_第2頁(yè)
第九講 分類與預(yù)測(cè)_第3頁(yè)
第九講 分類與預(yù)測(cè)_第4頁(yè)
第九講 分類與預(yù)測(cè)_第5頁(yè)
已閱讀5頁(yè),還剩89頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

第八講

分類與預(yù)測(cè)1廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification(自學(xué))ClassificationbybackpropagationSupportVectorMachines(SVM)(自學(xué))Associativeclassification(自學(xué))Lazylearners(orlearningfromyourneighbors)(自學(xué))Otherclassificationmethods(自學(xué))PredictionAccuracyanderrormeasures(自學(xué))Ensemblemethods(自學(xué))Modelselection(自學(xué))Summary2廈門大學(xué)軟件學(xué)院分類(Classification):

預(yù)測(cè)分類標(biāo)號(hào)(離散值或名詞性詞)建立一個(gè)模型,基于訓(xùn)練集的數(shù)據(jù)和分類屬性的值(類標(biāo)識(shí))來(lái)分類,并在新數(shù)據(jù)中使用該模型。預(yù)測(cè)(Prediction):連續(xù)值函數(shù)上的建模,例如,預(yù)測(cè)未知或缺失的值典型應(yīng)用信用度目標(biāo)市場(chǎng)醫(yī)療診斷分類vs.預(yù)測(cè)3廈門大學(xué)軟件學(xué)院分類的兩個(gè)步驟模型創(chuàng)建:描述預(yù)定的數(shù)據(jù)類集或概念集。假定每個(gè)元組或樣本屬于一個(gè)預(yù)定義的類,由一個(gè)稱為類標(biāo)號(hào)屬性(classlabelattribute)的屬性確定。用于建模的元組集稱為訓(xùn)練集(trainingset)模型表示為:分類規(guī)則、判定樹或?qū)傩怨侥P蛻?yīng)用:用于分類未來(lái)或未知對(duì)象評(píng)估模型的準(zhǔn)確率測(cè)試樣本的已知標(biāo)號(hào)與根據(jù)模型獲得的分類結(jié)果作比較。準(zhǔn)確率定義為正確被模型分類的測(cè)試樣本的百分比測(cè)試集獨(dú)立于訓(xùn)練集,否則,學(xué)習(xí)模型傾向于過(guò)分適合(over-fitting)數(shù)據(jù)如果準(zhǔn)確率可被接受,使用模型用于分類類標(biāo)號(hào)未知的數(shù)據(jù)。4廈門大學(xué)軟件學(xué)院ClassificationProcess(1):ModelConstructionTrainingDataClassificationAlgorithmsIFrank=‘professor’ORyears>6THENtenured=‘yes’Classifier(Model)5廈門大學(xué)軟件學(xué)院ClassificationProcess(2):UsetheModelinPredictionClassifierTestingDataUnseenData(Jeff,Professor,4)Tenured?6廈門大學(xué)軟件學(xué)院決策樹算法基本算法(貪心算法),樹的構(gòu)建方式:自頂向下遞歸的各個(gè)擊破方式樹以代表訓(xùn)練樣本的單個(gè)節(jié)點(diǎn)開始如果樣本都在同一類中,則節(jié)點(diǎn)為葉子節(jié)點(diǎn),并用該類標(biāo)記否則,選擇能夠最好地將樣本分類的屬性(稱為測(cè)試屬性,必須是離散值的)對(duì)測(cè)試屬性的每個(gè)已知值,創(chuàng)建一個(gè)分支,并據(jù)此劃分樣本遞歸形成每個(gè)劃分上的樣本決策樹7廈門大學(xué)軟件學(xué)院遞歸劃分步驟僅當(dāng)下列條件之一成立時(shí)立即停止給定節(jié)點(diǎn)的所有樣本屬于同一個(gè)類沒(méi)有剩余屬性可作進(jìn)一步劃分樣本——majorityvotingisemployedforclassifyingtheleaf沒(méi)有剩余的樣本age?student?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesno8廈門大學(xué)軟件學(xué)院AttributeSelectionMeasure:InformationGain(ID3/C4.5)選擇具有最高信息增益(informationgain

)的屬性。pi

是D中任意元組屬于類Ci的概率,用估計(jì)|Ci,D|/|D|Expectedinformation(熵:entropy)對(duì)D中元組分類所需的期望信息:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeA9廈門大學(xué)軟件學(xué)院AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”

means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,10廈門大學(xué)軟件學(xué)院ComputingInformation-GainforContinuous-ValueAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-point11廈門大學(xué)軟件學(xué)院GainRatioforAttributeSelection(C4.5)信息增益度量偏向具有許多輸出的測(cè)試C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/0.926=0.031Theattributewiththemaximumgainratioisselectedasthesplittingattribute12廈門大學(xué)軟件學(xué)院Gini指標(biāo)(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas

wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedas不純度降低:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattribute)13廈門大學(xué)軟件學(xué)院Giniindex(CART,IBMIntelligentMiner)Ex.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2butgini{medium,high}is0.30andthusthebestsinceitisthelowestAllattributesareassumedcontinuous-valuedMayneedothertools,e.g.,clustering,togetthepossiblesplitvaluesCanbemodifiedforcategoricalattributes14廈門大學(xué)軟件學(xué)院ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions15廈門大學(xué)軟件學(xué)院OverfittingandTreePruningOverfitting:過(guò)份擬合AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:提前停止樹的構(gòu)造,如果劃分一個(gè)節(jié)點(diǎn)的元組導(dǎo)致低于預(yù)定義閾值的分裂,則結(jié)束進(jìn)一步的劃分DifficulttochooseanappropriatethresholdPostpruning:由“完全生長(zhǎng)”的樹減去子樹(用樹葉替代對(duì)應(yīng)的子樹)Useasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”16廈門大學(xué)軟件學(xué)院決策樹歸納基本算法的加強(qiáng)允許連續(xù)值屬性通過(guò)將連續(xù)屬性值劃分到離散值空間,以動(dòng)態(tài)定義新的離散值屬性處理缺失屬性值缺失值賦值以屬性的最常見值缺失值賦值以屬性的最或然值屬性構(gòu)建在給定的屬性上創(chuàng)建新的屬性,改變給定屬性的首先表示,是數(shù)據(jù)變換的一種形式解決決策樹歸納可能面臨的碎片、復(fù)制和重復(fù)問(wèn)題。17廈門大學(xué)軟件學(xué)院大型數(shù)據(jù)庫(kù)中的分類分類問(wèn)題是統(tǒng)計(jì)學(xué)和機(jī)器學(xué)習(xí)研究者研究的經(jīng)典難題的一個(gè)擴(kuò)展??缮炜s性:在具有數(shù)以百萬(wàn)計(jì)的樣本和成千上百個(gè)屬性的數(shù)據(jù)集上,以可被接受的速度作分類為什么在數(shù)據(jù)挖掘中使用決策樹歸納?學(xué)習(xí)速度相當(dāng)快(與其他分類方法比較)分類規(guī)則簡(jiǎn)單易懂對(duì)數(shù)據(jù)庫(kù)的訪問(wèn)可使用SQL查詢可比較分類的準(zhǔn)確度18廈門大學(xué)軟件學(xué)院可伸縮的決策樹歸納方法SLIQ(EDBT’96—Mehtaetal.)buildsanindexforeachattributeandonlyclasslistandthecurrentattributelistresideinmemorySPRINT(VLDB’96—J.Shaferetal.)constructsanattributelistdatastructurePUBLIC(VLDB’98—Rastogi&Shim)integratestreesplittingandtreepruning:stopgrowingthetreeearlierRainForest(VLDB’98—Gehrke,Ramakrishnan&Ganti)separatesthescalabilityaspectsfromthecriteriathatdeterminethequalityofthetreebuildsanAVC-list(attribute,value,classlabel)19廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary20廈門大學(xué)軟件學(xué)院貝葉斯分類貝葉斯分類是統(tǒng)計(jì)學(xué)分類方法貝葉斯分類基于貝葉斯定理樸素貝葉斯分類假定一個(gè)屬性值對(duì)給定類的影響?yīng)毩⒂谄渌麑傩缘闹?類條件獨(dú)立)。貝葉斯信念網(wǎng)絡(luò)能表示屬性子集間的依賴,也可用于分類。21廈門大學(xué)軟件學(xué)院貝葉斯定理:BasicsX

是類標(biāo)識(shí)未知的數(shù)據(jù)樣本(或數(shù)據(jù)元組)如:35歲收入$4000的顧客H

是數(shù)據(jù)樣本X屬于某特定類C的某種假定。如:假設(shè)顧客將購(gòu)買計(jì)算機(jī)P(H/X):條件X下H的后驗(yàn)概率如:知道顧客年齡與收入時(shí),顧客X將購(gòu)買計(jì)算機(jī)的概率P(H):H的先驗(yàn)概率,即在我們觀察任何樣本前的初始概率,它反應(yīng)了背景知識(shí)。如:任意給定的顧客將購(gòu)買計(jì)算機(jī)的概率。P(X):被觀察的樣本數(shù)據(jù)的概率如:顧客中年齡35歲收入$4000的概率P(X|H):條件H下,X的后驗(yàn)概率如:已知顧客購(gòu)買計(jì)算機(jī),該顧客為35歲收入$4000的概率貝葉斯定理:22廈門大學(xué)軟件學(xué)院TowardsNa?veBayesianClassifierLetDbeatrainingsetoftuplesandtheirassociatedclasslabels,andeachtupleisrepresentedbyann-DattributevectorX=(x1,x2,…,xn)SupposetherearemclassesC1,C2,…,Cm.Classificationistoderivethemaximumposteriori,i.e.,themaximalP(Ci|X)ThiscanbederivedfromBayes’theoremSinceP(X)isconstantforallclasses,onlyneedstobemaximized23廈門大學(xué)軟件學(xué)院DerivationofNa?veBayesClassifierAsimplifiedassumption:attributesareconditionallyindependent(i.e.,nodependencerelationbetweenattributes):Thisgreatlyreducesthecomputationcost:OnlycountstheclassdistributionIfAkiscategorical,P(xk|Ci)isthe#oftuplesinCihavingvaluexkforAkdividedby|Ci,D|(#oftuplesofCiinD)IfAkiscontinous-valued,P(xk|Ci)isusuallycomputedbasedonGaussiandistributionwithameanμandstandarddeviationσandP(xk|Ci)is24廈門大學(xué)軟件學(xué)院Na?veBayesianClassifier:TrainingDatasetClass:C1:buys_computer=‘yes’C2:buys_computer=‘no’DatasampleX=(age<=30,Income=medium,Student=yesCredit_rating=Fair)25廈門大學(xué)軟件學(xué)院Na?veBayesianClassifier:AnExampleP(Ci):P(buys_computer=“yes”)=9/14=0.643P(buys_computer=“no”)=5/14=0.357ComputeP(X|Ci)foreachclassP(age=“<=30”|buys_computer=“yes”)=2/9=0.222P(age=“<=30”|buys_computer=“no”)=3/5=0.6P(income=“medium”|buys_computer=“yes”)=4/9=0.444P(income=“medium”|buys_computer=“no”)=2/5=0.4P(student=“yes”|buys_computer=“yes)=6/9=0.667P(student=“yes”|buys_computer=“no”)=1/5=0.2P(credit_rating=“fair”|buys_computer=“yes”)=6/9=0.667P(credit_rating=“fair”|buys_computer=“no”)=2/5=0.4X=(age<=30,income=medium,student=yes,credit_rating=fair)

P(X|Ci):P(X|buys_computer=“yes”)=0.222x0.444x0.667x0.667=0.044P(X|buys_computer=“no”)=0.6x0.4x0.2x0.4=0.019P(X|Ci)*P(Ci):P(X|buys_computer=“yes”)*P(buys_computer=“yes”)=0.028

P(X|buys_computer=“no”)*P(buys_computer=“no”)=0.007Therefore,Xbelongstoclass(“buys_computer=yes”)

26廈門大學(xué)軟件學(xué)院Avoidingthe0-ProbabilityProblemNa?veBayesianpredictionrequireseachconditionalprob.benon-zero.Otherwise,thepredictedprob.willbezero

Ex.Supposeadatasetwith1000tuples,income=low(0),income=medium(990),andincome=high(10),UseLaplaciancorrection(orLaplacianestimator)Adding1toeachcaseProb(income=low)=1/1003Prob(income=medium)=991/1003Prob(income=high)=11/1003The“corrected”prob.estimatesareclosetotheir“uncorrected”counterparts27廈門大學(xué)軟件學(xué)院Na?veBayesianClassifier:CommentsAdvantagesEasytoimplementGoodresultsobtainedinmostofthecasesDisadvantagesAssumption:classconditionalindependence,thereforelossofaccuracyPractically,dependenciesexistamongvariablesHowtodealwiththesedependencies?BayesianBeliefNetworks28廈門大學(xué)軟件學(xué)院貝葉斯網(wǎng)絡(luò)貝葉斯信念網(wǎng)絡(luò)允許在變量的子集間定義類條件的獨(dú)立性因果關(guān)系的圖形建模表示了變量間的依賴關(guān)系說(shuō)明聯(lián)合條件概率分布節(jié)點(diǎn):隨機(jī)變量(離散或連續(xù))弧:概率依賴X,YaretheparentsofZ,andYistheparentofPNodependencybetweenZandP有向無(wú)環(huán)圖29廈門大學(xué)軟件學(xué)院AnExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLC~LC(FH,S)(FH,~S)(~FH,S)(~FH,~S)0.10.9BayesianBeliefNetworks對(duì)于每個(gè)變量,信念網(wǎng)絡(luò)有一個(gè)條件概率表CPT,P(Y|Parent(Y))30廈門大學(xué)軟件學(xué)院訓(xùn)練貝葉斯信念網(wǎng)絡(luò)許多情況都有可能如果網(wǎng)絡(luò)結(jié)構(gòu)已知并且變量是可見的:只需計(jì)算CPT(條件概率表)網(wǎng)絡(luò)結(jié)構(gòu)已知,但有些變量隱藏在樣本中:使用梯度下降方法訓(xùn)練信念網(wǎng)絡(luò),類似于神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)網(wǎng)絡(luò)結(jié)構(gòu)未知,所有變量都可見:從模型空間查找,重新構(gòu)建圖的拓?fù)浣Y(jié)構(gòu)未知網(wǎng)絡(luò)結(jié)構(gòu),所有變量都隱藏:還未有好的算法31廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification(自學(xué))ClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary32廈門大學(xué)軟件學(xué)院UsingIF-THENRulesforClassification以IF-THEN規(guī)則的形式來(lái)表示知識(shí)R:IFage=youthANDstudent=yesTHENbuys_computer=yes規(guī)則前件或前提vs.規(guī)則結(jié)論規(guī)則的覆蓋率與準(zhǔn)確率ncovers=#oftuplescoveredbyR(規(guī)則前件被滿足)ncorrect=#oftuplescorrectlyclassifiedbyRcoverage(R)=ncovers/|D|/*D:trainingdataset*/accuracy(R)=ncorrect/ncovers如果元組被多條規(guī)則所“觸發(fā)”,則需要解決沖突規(guī)模序(Sizeordering):assignthehighestprioritytothetriggeringrulesthathasthe“toughest”requirement(i.e.,withthemostattributetest)基于類的序(Class-basedordering):類按重要性遞減排序規(guī)則序(Rule-basedordering(decisionlist)):rulesareorganizedintoonelongprioritylist,accordingtosomemeasureofrulequalityorbyexperts33廈門大學(xué)軟件學(xué)院age?student?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesno從決策樹中提取規(guī)則以IF-THEN規(guī)則的形式來(lái)表示知識(shí)對(duì)從根到樹葉的每一條路徑創(chuàng)建一條規(guī)則。沿著給定路徑上的每個(gè)屬性——值對(duì),形成規(guī)則前件(“IF”部分)葉子節(jié)點(diǎn)包含類預(yù)測(cè),形成規(guī)則后件(“THEN”部分)IF-THEN規(guī)則易于理解。ExampleIFage=“<=30”ANDstudent=“no”THENbuys_computer=“no”IFage=“<=30”ANDstudent=“yes”THENbuys_computer=“yes”IFage=“31…40” THENbuys_computer=“yes”IFage=“>40”ANDcredit_rating=“excellent”THENbuys_computer=“yes”IFage=“<=30”ANDcredit_rating=“fair”THENbuys_computer=“no”34廈門大學(xué)軟件學(xué)院RuleExtractionfromtheTrainingData順序覆蓋算法:直接從訓(xùn)練數(shù)據(jù)提取IF-THEN規(guī)則,不必產(chǎn)生決策樹典型的順序覆蓋算法:FOIL,AQ,CN2,RIPPER規(guī)則被順序?qū)W習(xí),理想地,在為C類學(xué)習(xí)規(guī)則時(shí),希望規(guī)則覆蓋C類的所有(或許多)訓(xùn)練元組,并且沒(méi)有(或很少)覆蓋其他類的原則Steps:一次學(xué)習(xí)一條規(guī)則每當(dāng)學(xué)習(xí)一個(gè)規(guī)則,就刪除該規(guī)則覆蓋的元組對(duì)剩下的元組重復(fù)該過(guò)程,直到遇到終止條件(無(wú)訓(xùn)練樣本或規(guī)則質(zhì)量低于設(shè)定閾值)35廈門大學(xué)軟件學(xué)院HowtoLearn-One-Rule?Starwiththemostgeneralrulepossible:condition=emptyAddingnewattributesbyadoptingagreedydepth-firststrategyPickstheonethatmostimprovestherulequalityRule-Qualitymeasures:considerbothcoverageandaccuracyFoil-gain(inFOIL&RIPPER):assessesinfo_gainbyextendingconditionItfavorsrulesthathavehighaccuracyandcovermanypositivetuplesRulepruningbasedonanindependentsetoftesttuplesPos/negare#ofpositive/negativetuplescoveredbyR.IfFOIL_PruneishigherfortheprunedversionofR,pruneR36廈門大學(xué)軟件學(xué)院37廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary38廈門大學(xué)軟件學(xué)院分類:

預(yù)測(cè)分類的類標(biāo)識(shí)典型應(yīng)用{credithistory,salary}->creditapproval(Yes/No){Temp,Humidity}-->Rain(Yes/No)分類的數(shù)學(xué)映射數(shù)學(xué)方式:39廈門大學(xué)軟件學(xué)院線性分類二進(jìn)制分類問(wèn)題紅線上方的數(shù)據(jù)屬于類’x’紅線下方的數(shù)據(jù)屬于類‘o’例子–SVM,Perceptron,ProbabilisticClassifiersxxxxxxxxxxooooooooooooo40廈門大學(xué)軟件學(xué)院DiscriminativeClassifiers優(yōu)點(diǎn)預(yù)測(cè)準(zhǔn)確率一般更高(與貝葉斯方法比較)對(duì)在樣本包含錯(cuò)誤信息的數(shù)據(jù)集上工作時(shí),具有較強(qiáng)的魯棒性快速評(píng)估被學(xué)習(xí)的目標(biāo)函數(shù)(貝葉斯網(wǎng)絡(luò)通常較慢)限制漫長(zhǎng)的訓(xùn)練時(shí)間網(wǎng)絡(luò)構(gòu)建和訓(xùn)練過(guò)程中,參數(shù)設(shè)置多,很難確定可理解性差41廈門大學(xué)軟件學(xué)院ClassificationbyBackpropagation后向傳播(Backpropagation):一種神經(jīng)網(wǎng)絡(luò)(neuralnetwork)學(xué)習(xí)算法最早由心理學(xué)家和神經(jīng)學(xué)家開創(chuàng),旨在尋求開發(fā)和測(cè)試神經(jīng)的計(jì)算模擬神經(jīng)網(wǎng)絡(luò):一組連接的輸入輸出單元,每個(gè)連接都與一個(gè)權(quán)重(weight)關(guān)聯(lián)在學(xué)習(xí)過(guò)程中,通過(guò)調(diào)整權(quán)重,能預(yù)測(cè)輸入元組的正確類標(biāo)號(hào)又稱“連接者學(xué)習(xí)”(connectionistlearning)42廈門大學(xué)軟件學(xué)院NeuralNetworkasaClassifierWeaknessLongtrainingtimeRequireanumberofparameterstypicallybestdeterminedempirically,e.g.,thenetworktopologyor``structure."Poorinterpretability:Difficulttointerpretthesymbolicmeaningbehindthelearnedweightsandof``hiddenunits"inthenetworkStrengthHightolerancetonoisydataAbilitytoclassifyuntrainedpatternsWell-suitedforcontinuous-valuedinputsandoutputsSuccessfulonawidearrayofreal-worlddataAlgorithmsareinherentlyparallelTechniqueshaverecentlybeendevelopedfortheextractionofrulesfromtrainedneuralnetworks43廈門大學(xué)軟件學(xué)院神經(jīng)元(感知器)Then-dimensionalinputvectorxismappedintovariableybymeansofthescalarproductandanonlinearfunctionmappingmk-fweightedsumInputvectorxoutputyActivationfunctionweightvectorw?w0w1wnx0x1xn44廈門大學(xué)軟件學(xué)院AMulti-LayerFeed-ForwardNeuralNetworkOutputlayerInputlayerHiddenlayerOutputvectorInputvector:Xwij45廈門大學(xué)軟件學(xué)院HowAMulti-LayerNeuralNetworkWorks?網(wǎng)絡(luò)的輸入對(duì)應(yīng)于每個(gè)訓(xùn)練元組測(cè)量的屬性輸入同時(shí)提供給稱作輸入層(inputlayer)的單元層。輸入層同時(shí)加權(quán)提供給隱藏層(hiddenlayer)隱藏層的數(shù)量是任意的,實(shí)踐中通常只有一層最后一個(gè)隱藏層的加權(quán)輸出作為輸出層的單元的輸入,輸出層發(fā)布給定元組的網(wǎng)絡(luò)預(yù)測(cè)。前饋網(wǎng)絡(luò):權(quán)重都不回送到輸入單元或前一層的輸出單元的網(wǎng)絡(luò)從統(tǒng)計(jì)學(xué)觀點(diǎn)來(lái)看,神經(jīng)網(wǎng)絡(luò)進(jìn)行的時(shí)非線性回歸。給定足夠多的隱藏單元和足夠的訓(xùn)練樣本,多層前饋網(wǎng)絡(luò)可以逼近任何函數(shù)。46廈門大學(xué)軟件學(xué)院DefiningaNetworkTopologyFirstdecidethenetworktopology:#ofunitsintheinputlayer,#ofhiddenlayers(if>1),#ofunitsineachhiddenlayer,and#ofunitsintheoutputlayerNormalizingtheinputvaluesforeachattributemeasuredinthetrainingtuplesto[0.0—1.0]Oneinputunitperdomainvalue,eachinitializedto0Output,ifforclassificationandmorethantwoclasses,oneoutputunitperclassisusedOnceanetworkhasbeentrainedanditsaccuracyisunacceptable,repeatthetrainingprocesswithadifferentnetworktopologyoradifferentsetofinitialweights47廈門大學(xué)軟件學(xué)院Backpropagation迭代地處理訓(xùn)練元組數(shù)據(jù)集,將每個(gè)元組的網(wǎng)絡(luò)預(yù)測(cè)與實(shí)際已知的目標(biāo)值做比較對(duì)于每個(gè)訓(xùn)練樣本,修改權(quán)重使網(wǎng)絡(luò)預(yù)測(cè)和實(shí)際目標(biāo)間的均方誤差最小修改“后向”進(jìn)行:由輸出層經(jīng)由每個(gè)隱藏藏,到第一個(gè)隱藏層StepsInitializeweights(tosmallrandom#s)andbiasesinthenetworkPropagatetheinputsforward(byapplyingactivationfunction)Backpropagatetheerror(byupdatingweightsandbiases)Terminatingcondition(whenerrorisverysmall,etc.)48廈門大學(xué)軟件學(xué)院49廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)(自學(xué))AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary50廈門大學(xué)軟件學(xué)院SVM—SupportVectorMachines線性和非線性數(shù)據(jù)的新分類方法使用非線性映射,將原訓(xùn)練數(shù)據(jù)映射到較高的維在新的維上,它搜索線性最佳分離超平面(決策邊界)

使用一個(gè)適當(dāng)?shù)淖銐蚋呔S的非線性映射,兩類數(shù)據(jù)總可以被超平面所分開。SVM使用支持向量

(“essential”trainingtuples)和邊緣(definedbythesupportvectors)發(fā)現(xiàn)超平面51廈門大學(xué)軟件學(xué)院SVM—HistoryandApplicationsVapnikandcolleagues(1992)—groundworkfromVapnik&Chervonenkis’statisticallearningtheoryin1960sFeatures:trainingcanbeslowbutaccuracyishighowingtotheirabilitytomodelcomplexnonlineardecisionboundaries(marginmaximization)UsedbothforclassificationandpredictionApplications:handwrittendigitrecognition,objectrecognition,speakeridentification,benchmarkingtime-seriespredictiontests52廈門大學(xué)軟件學(xué)院SVM—GeneralPhilosophySupportVectorsSmallMarginLargeMargin53廈門大學(xué)軟件學(xué)院SVM—MarginsandSupportVectors54廈門大學(xué)軟件學(xué)院SVM—WhenDataIsLinearlySeparablemLetdataDbe(X1,y1),…,(X|D|,y|D|),whereXiisthesetoftrainingtuplesassociatedwiththeclasslabelsyiThereareinfinitelines(hyperplanes)separatingthetwoclassesbutwewanttofindthebestone(theonethatminimizesclassificationerroronunseendata)SVMsearchesforthehyperplanewiththelargestmargin,i.e.,maximummarginalhyperplane(MMH)55廈門大學(xué)軟件學(xué)院SVM—LinearlySeparableAseparatinghyperplanecanbewrittenasW●X+b=0whereW={w1,w2,…,wn}isaweightvectorandbascalar(bias)For2-Ditcanbewrittenasw0+w1x1+w2x2=0Thehyperplanedefiningthesidesofthemargin:H1:w0+w1x1+w2x2≥1foryi=+1,andH2:w0+w1x1+w2x2≤–1foryi=–1AnytrainingtuplesthatfallonhyperplanesH1orH2(i.e.,the

sidesdefiningthemargin)aresupportvectors56廈門大學(xué)軟件學(xué)院SVM—LinearlyInseparableTransformtheoriginalinputdataintoahigherdimensionalspaceSearchforalinearseparatinghyperplaneinthenewspace57廈門大學(xué)軟件學(xué)院SVM—KernelfunctionsInsteadofcomputingthedotproductonthetransformeddatatuples,itismathematicallyequivalenttoinsteadapplyingakernelfunctionK(Xi,Xj)totheoriginaldata,i.e.,K(Xi,Xj)=Φ(Xi)Φ(Xj)TypicalKernelFunctionsSVMcanalsobeusedforclassifyingmultiple(>2)classesandforregressionanalysis(withadditionaluserparameters)58廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)Associativeclassification(自學(xué))

Lazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary59廈門大學(xué)軟件學(xué)院AssociativeClassificationAssociativeclassification關(guān)聯(lián)規(guī)則的產(chǎn)生和分析旨在用于分類搜索頻繁模式與類標(biāo)號(hào)之間的強(qiáng)關(guān)聯(lián)規(guī)則。Classification:BasedonevaluatingasetofrulesintheformofP1^p2…^pl

“Aclass=C”(conf,sup)Whyeffective?關(guān)聯(lián)規(guī)則考察了多屬性間的高置信度關(guān)聯(lián),可以克服決策樹歸納一次只能考慮一個(gè)屬性的局限性。關(guān)聯(lián)分類比傳統(tǒng)的分類方法更準(zhǔn)確60廈門大學(xué)軟件學(xué)院TypicalAssociativeClassificationMethodsCBA(ClassificationByAssociation:Liu,Hsu&Ma,KDD’98)MineassociationpossiblerulesintheformofCond-set(asetofattribute-valuepairs)classlabelBuildclassifier:OrganizerulesaccordingtodecreasingprecedencebasedonconfidenceandthensupportCMAR(ClassificationbasedonMultipleAssociationRules:Li,Han,Pei,ICDM’01)Classification:StatisticalanalysisonmultiplerulesCPAR(ClassificationbasedonPredictiveAssociationRules:Yin&Han,SDM’03)Generationofpredictiverules(FOIL-likeanalysis)Highefficiency,accuracysimilartoCMARRCBT(Miningtop-kcoveringrulegroupsforgeneexpressiondata,Congetal.SIGMOD’05)Explorehigh-dimensionalclassification,usingtop-krulegroupsAchievehighclassificationaccuracyandhighrun-timeefficiency61廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)(自學(xué))OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary62廈門大學(xué)軟件學(xué)院Lazyvs.EagerLearning懶惰與急切學(xué)習(xí)法Eagerlearning(theabovediscussedmethods):Givenasetoftrainingset,constructsaclassificationmodelbeforereceivingnew(e.g.,test)datatoclassifyLazylearning(e.g.,instance-basedlearning):Simplystorestrainingdata(oronlyminorprocessing)andwaitsuntilitisgivenatesttupleLazy:lesstimeintrainingbutmoretimeinpredicting63廈門大學(xué)軟件學(xué)院LazyLearner:Instance-BasedMethodsInstance-basedlearning:Storetrainingexamplesanddelaytheprocessing(“l(fā)azyevaluation”)untilanewinstancemustbeclassifiedTypicalapproachesk-nearestneighborapproachInstancesrepresentedaspointsinaEuclideanspace.Case-basedreasoningUsessymbolicrepresentationsandknowledge-basedinference64廈門大學(xué)軟件學(xué)院Thek-NearestNeighborAlgorithmAllinstancescorrespondtopointsinthen-DspaceThenearestneighboraredefinedintermsofEuclideandistance,dist(X1,X2)Targetfunctioncouldbediscrete-orreal-valuedFordiscrete-valued,k-NNreturnsthemostcommonvalueamongthektrainingexamplesnearestto

xq

._+_xq+__+__+65廈門大學(xué)軟件學(xué)院Case-BasedReasoning(CBR)CBR:UsesadatabaseofproblemsolutionstosolvenewproblemsStoresymbolicdescription(tuplesorcases)—notpointsinaEuclideanspaceApplications:Customer-service(product-relateddiagnosis),legalrulingMethodologyInstancesrepresentedbyrichsymbolicdescriptions(e.g.,functiongraphs)Searchforsimilarcases,multipleretrievedcasesmaybecombinedTightcouplingbetweencaseretrieval,knowledge-basedreasoning,andproblemsolvingChallengesFindagoodsimilaritymetricIndexingbasedonsyntacticsimilaritymeasure,andwhenfailure,backtracking,andadaptingtoadditionalcases66廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)Otherclassificationmethods(自學(xué))PredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary67廈門大學(xué)軟件學(xué)院GeneticAlgorithms(GA)GeneticAlgorithm:basedonananalogytobiologicalevolutionAninitialpopulationiscreatedconsistingofrandomlygeneratedrulesEachruleisrepresentedbyastringofbitsE.g.,ifA1and?A2thenC2canbeencodedas100Ifanattributehask>2values,kbitscanbeusedBasedonthenotionofsurvivalofthefittest,anewpopulationisformedtoconsistofthefittestrulesandtheiroffspringsThefitnessofaruleisrepresentedbyitsclassificationaccuracyonasetoftrainingexamplesOffspringsaregeneratedbycrossoverandmutationTheprocesscontinuesuntilapopulationPevolveswheneachruleinPsatisfiesaprespecifiedthresholdSlowbuteasilyparallelizable68廈門大學(xué)軟件學(xué)院RoughSetApproachRoughsetsareusedtoapproximatelyor“roughly”defineequivalentclassesAroughsetforagivenclassCisapproximatedbytwosets:alowerapproximation(certaintobeinC)andanupperapproximation(cannotbedescribedasnotbelongingtoC)Findingtheminimalsubsets(reducts)ofattributesforfeaturereductionisNP-hardbutadiscernibilitymatrix(whichstoresthedifferencesbetweenattributevaluesforeachpairofdatatuples)isusedtoreducethecomputationintensity69廈門大學(xué)軟件學(xué)院FuzzySetApproachesFuzzylogicusestruthvaluesbetween0.0and1.0torepresentthedegreeofmembership(suchasusingfuzzymembershipgraph)Attributevaluesareconvertedtofuzzyvaluese.g.,incomeismappedintothediscretecategories{low,medium,high}withfuzzyvaluescalculatedForagivennewsample,morethanonefuzzyvaluemayapplyEachapplicablerulecontributesavoteformembershipinthecategoriesTypically,thetruthvaluesforeachpredictedcategoryaresummed,andthesesumsarecombined70廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary71廈門大學(xué)軟件學(xué)院WhatIsPrediction?(Numerical)predictionissimilartoclassificationconstructamodelusemodeltopredictcontinuousororderedvalueforagiveninputPredictionisdifferentfromclassificationClassificationreferstopredictcategoricalclasslabelPredictionmodelscontinuous-valuedfunctionsMajormethodforprediction:regression回歸分析可以用來(lái)對(duì)一個(gè)或多個(gè)獨(dú)立或預(yù)測(cè)變量和一個(gè)(連續(xù)值的)依賴或相應(yīng)變量之間的聯(lián)系建模。RegressionanalysisLinearandmultipleregressionNon-linearregressionOtherregressionmethods:generalizedlinearmodel,Poissonregression,log-linearmodels,regressiontrees72廈門大學(xué)軟件學(xué)院LinearRegression

Linearregression:involves

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論