Chapter 6. Classification and Prediction 數(shù)據(jù)挖掘：概念與技術(shù)- 英文版

上傳人：e*** IP屬地：湖北上傳時間：2022-12-31 格式：PPT 頁數(shù)：113 大小：1.39MB 積分：21 舉報 版權(quán)申訴

Chapter 6. Classification and Prediction 數(shù)據(jù)挖掘：概念與技術(shù)- 英文版_第2頁

Chapter 6. Classification and Prediction 數(shù)據(jù)挖掘：概念與技術(shù)- 英文版_第3頁

Chapter 6. Classification and Prediction 數(shù)據(jù)挖掘：概念與技術(shù)- 英文版_第4頁

Chapter 6. Classification and Prediction 數(shù)據(jù)挖掘：概念與技術(shù)- 英文版_第5頁

已閱讀5頁，還剩108頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認領(lǐng)

文檔簡介

31December2022DataMining:ConceptsandTechniques1DataMining:

ConceptsandTechniques

—Chapter6—JiaweiHanDepartmentofComputerScienceUniversityofIllinoisatUrbana-C/~hanj?2006JiaweiHanandMichelineKamber,Allrightsreserved31December2022DataMining:ConceptsandTechniques2Chapter6.ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary31December2022DataMining:ConceptsandTechniques3Classification

predictscategoricalclasslabels(discreteornominal)classifiesdata(constructsamodel)basedonthetrainingsetandthevalues(classlabels)inaclassifyingattributeandusesitinclassifyingnewdataPredictionmodelscontinuous-valuedfunctions,i.e.,predictsunknownormissingvaluesTypicalapplicationsCreditapprovalTargetmarketingMedicaldiagnosisFrauddetectionClassificationvs.Prediction31December2022DataMining:ConceptsandTechniques4Classification—ATwo-StepProcess

Modelconstruction:describingasetofpredeterminedclassesEachtuple/sampleisassumedtobelongtoapredefinedclass,asdeterminedbytheclasslabelattributeThesetoftuplesusedformodelconstructionistrainingsetThemodelisrepresentedasclassificationrules,decisiontrees,ormathematicalformulaeModelusage:forclassifyingfutureorunknownobjectsEstimateaccuracyofthemodelTheknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodelAccuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodelTestsetisindependentoftrainingset,otherwiseover-fittingwilloccurIftheaccuracyisacceptable,usethemodeltoclassifydatatupleswhoseclasslabelsarenotknown31December2022DataMining:ConceptsandTechniques5Process(1):ModelConstructionTrainingDataClassificationAlgorithmsIFrank=‘professor’ORyears>6THENtenured=‘yes’Classifier(Model)31December2022DataMining:ConceptsandTechniques6Process(2):UsingtheModelinPrediction

ClassifierTestingDataUnseenData(Jeff,Professor,4)Tenured?31December2022DataMining:ConceptsandTechniques7Supervisedvs.UnsupervisedLearningSupervisedlearning(classification)Supervision:Thetrainingdata(observations,measurements,etc.)areaccompaniedbylabelsindicatingtheclassoftheobservationsNewdataisclassifiedbasedonthetrainingsetUnsupervisedlearning

(clustering)TheclasslabelsoftrainingdataisunknownGivenasetofmeasurements,observations,etc.withtheaimofestablishingtheexistenceofclassesorclustersinthedata31December2022DataMining:ConceptsandTechniques8Chapter6.ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary31December2022DataMining:ConceptsandTechniques9Issues:DataPreparationDatacleaningPreprocessdatainordertoreducenoiseandhandlemissingvaluesRelevanceanalysis(featureselection)RemovetheirrelevantorredundantattributesDatatransformationGeneralizeand/ornormalizedata31December2022DataMining:ConceptsandTechniques10Issues:EvaluatingClassificationMethodsAccuracyclassifieraccuracy:predictingclasslabelpredictoraccuracy:guessingvalueofpredictedattributesSpeedtimetoconstructthemodel(trainingtime)timetousethemodel(classification/predictiontime)Robustness:handlingnoiseandmissingvaluesScalability:efficiencyindisk-residentdatabasesInterpretabilityunderstandingandinsightprovidedbythemodelOthermeasures,e.g.,goodnessofrules,suchasdecisiontreesizeorcompactnessofclassificationrules31December2022DataMining:ConceptsandTechniques11Chapter6.ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary31December2022DataMining:ConceptsandTechniques12DecisionTreeInduction:TrainingDatasetThisfollowsanexampleofQuinlan’sID3(PlayingTennis)31December2022DataMining:ConceptsandTechniques13Output:ADecisionTreefor“buys_computer〞age?overcaststudent?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesno31December2022DataMining:ConceptsandTechniques14AlgorithmforDecisionTreeInductionBasicalgorithm(agreedyalgorithm)Treeisconstructedinatop-downrecursivedivide-and-conquermannerAtstart,allthetrainingexamplesareattherootAttributesarecategorical(ifcontinuous-valued,theyarediscretedinadvance)SamplesarepartitionedrecursivelybasedonselectedattributesTestattributesareselectedonthebasisofaheuristicorstatisticalmeasure(e.g.,informationgain)ConditionsforstoppingpartitioningAllsamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning–majorityvotingisemployedforclassifyingtheleafTherearenosamplesleft31December2022DataMining:ConceptsandTechniques15AttributeSelectionMeasure:InformationGain(ID3/C4.5)SelecttheattributewiththehighestinformationgainLetpibetheprobabilitythatanarbitrarytupleinDbelongstoclassCi,estimatedby|Ci,D|/|D|Expectedinformation(entropy)neededtoclassifyatupleinD:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeA31December2022DataMining:ConceptsandTechniques16AttributeSelection:InformationGainClassP:buys_computer=“yes〞ClassN:buys_computer=“no〞means“age<=30〞has5outof14samples,with2yes’esand3no’s.HenceSimilarly,31December2022DataMining:ConceptsandTechniques17ComputingInformation-GainforContinuous-ValueAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-point31December2022DataMining:ConceptsandTechniques18GainRatioforAttributeSelection(C4.5)InformationgainmeasureisbiasedtowardsattributeswithalargenumberofvaluesC4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.Theattributewiththemaximumgainratioisselectedasthesplittingattribute31December2022DataMining:ConceptsandTechniques19Giniindex(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas

wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedasReductioninImpurity:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattribute)31December2022DataMining:ConceptsandTechniques20Giniindex(CART,IBMIntelligentMiner)Ex.Dhas9tuplesinbuys_computer=“yes〞and5in“no〞SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2butgini{medium,high}is0.30andthusthebestsinceitisthelowestAllattributesareassumedcontinuous-valuedMayneedothertools,e.g.,clustering,togetthepossiblesplitvaluesCanbemodifiedforcategoricalattributes31December2022DataMining:ConceptsandTechniques21ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions31December2022DataMining:ConceptsandTechniques22OtherAttributeSelectionMeasuresCHAID:apopulardecisiontreealgorithm,measurebasedonχ2testforindependenceC-SEP:performsbetterthaninfo.gainandginiindexincertaincasesG-statistics:hasacloseapproximationtoχ2distributionMDL(MinimalDescriptionLength)principle(i.e.,thesimplestsolutionispreferred):Thebesttreeastheonethatrequiresthefewest#ofbitstoboth(1)encodethetree,and(2)encodetheexceptionstothetreeMultivariatesplits(partitionbasedonmultiplevariablecombinations)CART:findsmultivariatesplitsbasedonalinearcomb.ofattrs.Whichattributeselectionmeasureisthebest?Mostgivegoodresults,noneissignificantlysuperiorthanothers31December2022DataMining:ConceptsandTechniques23OverfittingandTreePruningOverfitting:AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:Halttreeconstructionearly—donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning:Removebranchesfroma“fullygrown〞tree—getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree〞31December2022DataMining:ConceptsandTechniques24EnhancementstoBasicDecisionTreeInductionAllowforcontinuous-valuedattributesDynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation,repetition,andreplication31December2022DataMining:ConceptsandTechniques25ClassificationinLargeDatabasesClassification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability:ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhydecisiontreeinductionindatamining?relativelyfasterlearningspeed(thanotherclassificationmethods)convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethods31December2022DataMining:ConceptsandTechniques26ScalableDecisionTreeInductionMethodsSLIQ(EDBT’96—Mehtaetal.)BuildsanindexforeachattributeandonlyclasslistandthecurrentattributelistresideinmemorySPRINT(VLDB’96—J.Shaferetal.)ConstructsanattributelistdatastructurePUBLIC(VLDB’98—Rastogi&Shim)Integratestreesplittingandtreepruning:stopgrowingthetreeearlierRainForest(VLDB’98—Gehrke,Ramakrishnan&Ganti)BuildsanAVC-list(attribute,value,classlabel)BOAT(PODS’99—Gehrke,Ganti,Ramakrishnan&Loh)Usesbootstrappingtocreateseveralsmallsamples31December2022DataMining:ConceptsandTechniques27ScalabilityFrameworkforRainForestSeparatesthescalabilityaspectsfromthecriteriathatdeterminethequalityofthetreeBuildsanAVC-list:AVC(Attribute,Value,Class_label)AVC-set(ofanattributeX)ProjectionoftrainingdatasetontotheattributeXandclasslabelwherecountsofindividualclasslabelareaggregatedAVC-group(ofanoden)SetofAVC-setsofallpredictorattributesatthenoden

31December2022DataMining:ConceptsandTechniques28Rainforest:TrainingSetandItsAVCSets

studentBuy_Computeryesnoyes61no34AgeBuy_Computeryesno<=303231..4040>4032CreditratingBuy_Computeryesnofair62excellent33AVC-setonincomeAVC-setonAgeAVC-setonStudentTrainingExamplesincomeBuy_Computeryesnohigh22medium42low31AVC-setoncredit_rating31December2022DataMining:ConceptsandTechniques29DataCube-BasedDecision-TreeInductionIntegrationofgeneralizationwithdecision-treeinduction(Kamberetal.’97)ClassificationatprimitiveconceptlevelsE.g.,precisetemperature,humidity,outlook,etc.Low-levelconcepts,scatteredclasses,bushyclassification-treesSemanticinterpretationproblemsCube-basedmulti-levelclassificationRelevanceanalysisatmulti-levelsInformation-gainanalysiswithdimension+level31December2022DataMining:ConceptsandTechniques30BOAT(BootstrappedOptimisticAlgorithmforTreeConstruction)Useastatisticaltechniquecalledbootstrappingtocreateseveralsmallersamples(subsets),eachfitsinmemoryEachsubsetisusedtocreateatree,resultinginseveraltreesThesetreesareexaminedandusedtoconstructanewtreeT’ItturnsoutthatT’isveryclosetothetreethatwouldbegeneratedusingthewholedatasettogetherAdv:requiresonlytwoscansofDB,anincrementalalg.31December2022DataMining:ConceptsandTechniques31PresentationofClassificationResults31December2022DataMining:ConceptsandTechniques3231December2022DataMining:ConceptsandTechniques33InteractiveVisualMiningbyPerception-BasedClassification(PBC)31December2022DataMining:ConceptsandTechniques34Chapter6.ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary31December2022DataMining:ConceptsandTechniques35BayesianClassification:Why?Astatisticalclassifier:performsprobabilisticprediction,i.e.,predictsclassmembershipprobabilitiesFoundation:BasedonBayes’Theorem.Performance:AsimpleBayesianclassifier,na?veBayesianclassifier,hascomparableperformancewithdecisiontreeandselectedneuralnetworkclassifiersIncremental:Eachtrainingexamplecanincrementallyincrease/decreasetheprobabilitythatahypothesisiscorrect—priorknowledgecanbecombinedwithobserveddataStandard:EvenwhenBayesianmethodsarecomputationallyintractable,theycanprovideastandardofoptimaldecisionmakingagainstwhichothermethodscanbemeasured31December2022DataMining:ConceptsandTechniques36BayesianTheorem:BasicsLetXbeadatasample(“evidence〞):classlabelisunknownLetHbeahypothesisthatXbelongstoclassCClassificationistodetermineP(H|X),theprobabilitythatthehypothesisholdsgiventheobserveddatasampleXP(H)(priorprobability),theinitialprobabilityE.g.,Xwillbuycomputer,regardlessofage,income,…P(X):probabilitythatsampledataisobservedP(X|H)(posterioriprobability),theprobabilityofobservingthesampleX,giventhatthehypothesisholdsE.g.,GiventhatXwillbuycomputer,theprob.thatXis31..40,mediumincome31December2022DataMining:ConceptsandTechniques37BayesianTheoremGiventrainingdata

X,posterioriprobabilityofahypothesisH,P(H|X),followstheBayestheorem

PredictsXbelongstoCiifftheprobabilityP(Ci|X)isthehighestamongalltheP(Ck|X)forallthekclassesPracticaldifficulty:requireinitialknowledgeofmanyprobabilities,significantcomputationalcost31December2022DataMining:ConceptsandTechniques38TowardsNa?veBayesianClassifierLetDbeatrainingsetoftuplesandtheirassociatedclasslabels,andeachtupleisrepresentedbyann-DattributevectorX=(x1,x2,…,xn)SupposetherearemclassesC1,C2,…,Cm.Classificationistoderivethemaximumposteriori,i.e.,themaximalP(Ci|X)ThiscanbederivedfromBayes’theoremSinceP(X)isconstantforallclasses,onlyneedstobemaximized31December2022DataMining:ConceptsandTechniques39DerivationofNa?veBayesClassifierAsimplifiedassumption:attributesareconditionallyindependent(i.e.,nodependencerelationbetweenattributes):Thisgreatlyreducesthecomputationcost:OnlycountstheclassdistributionIfAkiscategorical,P(xk|Ci)isthe#oftuplesinCihavingvaluexkforAkdividedby|Ci,D|(#oftuplesofCiinD)IfAkiscontinous-valued,P(xk|Ci)isusuallycomputedbasedonGaussiandistributionwithameanμandstandarddeviationσandP(xk|Ci)is31December2022DataMining:ConceptsandTechniques40Na?veBayesianClassifier:TrainingDatasetClass:C1:buys_computer=‘yes’C2:buys_computer=‘no’DatasampleX=(age<=30,Income=medium,Student=yesCredit_rating=Fair)31December2022DataMining:ConceptsandTechniques41Na?veBayesianClassifier:AnExampleP(Ci):ComputeP(X|Ci)foreachclassX=(age<=30,income=medium,student=yes,credit_rating=fair)P(X|Ci):P(X|Ci)*P(Ci):

Therefore,Xbelongstoclass(“buys_computer=yes〞) 31December2022DataMining:ConceptsandTechniques42Avoidingthe0-ProbabilityProblemNa?veBayesianpredictionrequireseachconditionalprob.benon-zero.Otherwise,thepredictedprob.willbezero

Ex.Supposeadatasetwith1000tuples,income=low(0),income=medium(990),andincome=high(10),UseLaplaciancorrection(orLaplacianestimator)Adding1toeachcaseProb(income=low)=1/1003Prob(income=medium)=991/1003Prob(income=high)=11/1003The“corrected〞prob.estimatesareclosetotheir“uncorrected〞counterparts31December2022DataMining:ConceptsandTechniques43Na?veBayesianClassifier:CommentsAdvantagesEasytoimplementGoodresultsobtainedinmostofthecasesDisadvantagesAssumption:classconditionalindependence,thereforelossofaccuracyPractically,dependenciesexistamongvariablesE.g.,hospitals:patients:Profile:age,familyhistory,etc.Symptoms:fever,coughetc.,Disease:lungcancer,smoking,etc.DependenciesamongthesecannotbemodeledbyNa?veBayesianClassifierHowtodealwiththesedependencies?BayesianBeliefNetworks31December2022DataMining:ConceptsandTechniques44BayesianBeliefNetworksBayesianbeliefnetworkallowsasubsetofthevariablesconditionallyindependentAgraphicalmodelofcausalrelationshipsRepresentsdependencyamongthevariablesGivesaspecificationofjointprobabilitydistributionNodes:randomvariablesLinks:dependencyXandYaretheparentsofZ,andYistheparentofPNodependencybetweenZandPHasnoloopsorcycles31December2022DataMining:ConceptsandTechniques45BayesianBeliefNetwork:AnExampleFamilyHistoryLungCancerPositiveX-RaySmokerCoughHardtoBreathLC~LC(FH,S)(FH,~S)(~FH,S)(~FH,~S)0.10.9BayesianBeliefNetworksTheconditionalprobabilitytable(CPT)forvariableLungCancer:CPTshowstheconditionalprobabilityforeachpossiblecombinationofitsparentsDerivationoftheprobabilityofaparticularcombinationofvaluesofX,fromCPT:31December2022DataMining:ConceptsandTechniques46TrainingBayesianNetworksSeveralscenarios:Givenboththenetworkstructureandallvariablesobservable:learnonlytheCPTsNetworkstructureknown,somehiddenvariables:gradientdescent(greedyhill-climbing)method,similartoneuralnetworklearningNetworkstructureunknown,allvariablesobservable:searchthroughthemodelspacetoreconstructnetworktopologyUnknownstructure,allhiddenvariables:NogoodalgorithmsknownforthispurposeRef.D.Heckerman:Bayesiannetworksfordatamining31December2022DataMining:ConceptsandTechniques47Chapter6.ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary31December2022DataMining:ConceptsandTechniques48UsingIF-THENRulesforClassificationRepresenttheknowledgeintheformofIF-THENrulesR:IFage=youthANDstudent=yesTHENbuys_computer=yesRulepreconditionvs.ruleconsequentAssessmentofarule:coverageandaccuracyncovers=#oftuplescoveredbyRncorrect=#oftuplescorrectlyclassifiedbyRcoverage(R)=ncovers/|D|/*D:trainingdataset*/accuracy(R)=ncorrect/ncoversIfmorethanoneruleistriggered,needconflictresolutionSizeordering:assignthehighestprioritytothetriggeringrulesthathasthe“toughest〞requirement(i.e.,withthemostattributetest)Class-basedordering:decreasingorderofprevalenceormisclassificationcostperclassRule-basedordering(decisionlist):rulesareorganizedintoonelongprioritylist,accordingtosomemeasureofrulequalityorbyexperts31December2022DataMining:ConceptsandTechniques49age?student?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesnoExample:Ruleextractionfromourbuys_computerdecision-treeIFage=youngANDstudent=noTHENbuys_computer=noIFage=youngANDstudent=yesTHENbuys_computer=yesIFage=mid-age THENbuys_computer=yesIFage=oldANDcredit_rating=excellentTHENbuys_computer=yesIFage=youngANDcredit_rating=fairTHENbuys_computer=noRuleExtractionfromaDecisionTreeRulesareeasiertounderstandthanlargetreesOneruleiscreatedforeachpathfromtheroottoaleafEachattribute-valuepairalongapathformsaconjunction:theleafholdstheclasspredictionRulesaremutuallyexclusiveandexhaustive31December2022DataMining:ConceptsandTechniques50RuleExtractionfromtheTrainingDataSequentialcoveringalgorithm:ExtractsrulesdirectlyfromtrainingdataTypicalsequentialcoveringalgorithms:FOIL,AQ,CN2,RIPPERRulesarelearnedsequentially,eachforagivenclassCiwillcovermanytuplesofCibutnone(orfew)ofthetuplesofotherclassesSteps:RulesarelearnedoneatatimeEachtimearuleislearned,thetuplescoveredbytherulesareremovedTheprocessrepeatsontheremainingtuplesunlessterminationcondition,e.g.,whennomoretrainingexamplesorwhenthequalityofarulereturnedisbelowauser-specifiedthresholdComp.w.decision-treeinduction:learningasetofrulessimultaneously31December2022DataMining:ConceptsandTechniques51HowtoLearn-One-Rule?Starwiththemostgeneralrulepossible:condition=emptyAddingnewattributesbyadoptingagreedydepth-firststrategyPickstheonethatmostimprovestherulequalityRule-Qualitymeasures:considerbothcoverageandaccuracyFoil-gain(inFOIL&RIPPER):assessesinfo_gainbyextendingconditionItfavorsrulesthathavehighaccuracyandcovermanypositivetuplesRulepruningbasedonanindependentsetoftesttuplesPos/negare#ofpositive/negativetuplescoveredbyR.IfFOIL_PruneishigherfortheprunedversionofR,pruneR31December2022DataMining:ConceptsandTechniques52Chapter6.ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary31December2022DataMining:ConceptsandTechniques53Classification:predictscategoricalclasslabelsE.g.,Personalhomepageclassificationxi=(x1,x2,x3,…),yi=+1or–1x1:#ofaword“homepage〞x2:#ofaword“welcome〞MathematicallyxX,yY={+1,–1}Wewantafunctionf:XYClassification:AMathematicalMapping31December2022DataMining:ConceptsandTechniques54LinearClassificationBinaryClassificationproblemThedataabovetheredlinebelongstoclass‘x’Thedatabelowredlinebelongstoclass‘o’Examples:SVM,Perceptron,ProbabilisticClassifiersxxxxxxxxxxooooooooooooo31December2022DataMining:ConceptsandTechniques55DiscriminativeClassifiersAdvantagespredictionaccuracyisgenerallyhighAscomparedtoBayesianmethods–ingeneralrobust,workswhentrainingexamplescontainerrorsfastevaluationofthelearnedtargetfunctionBayesiannetworksarenormallyslowCriticismlongtrainingtimedifficulttounderstandthelearnedfunction(weights)BayesiannetworkscanbeusedeasilyforpatterndiscoverynoteasytoincorporatedomainknowledgeEasyintheformofpriorsonthedataordistributions31December2022DataMining:ConceptsandTechniques56Perceptron&Winnow

Vector:x,wScalar:x,y,wInput: {(x

人人文庫> 全部分類> 教育資料 > 輔導(dǎo)培訓(xùn)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責(zé)。
6. 下載文件中如有侵權(quán)或不適當內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

Chapter 6. Classification and Prediction 數(shù)據(jù)挖掘：概念與技術(shù)- 英文版

文檔簡介

溫馨提示

最新文檔

評論

Chapter 6. Classification and Prediction 數(shù)據(jù)挖掘：概念與技術(shù)- 英文版

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔