數(shù)據(jù)采集和營銷工具英文_第1頁
數(shù)據(jù)采集和營銷工具英文_第2頁
數(shù)據(jù)采集和營銷工具英文_第3頁
數(shù)據(jù)采集和營銷工具英文_第4頁
數(shù)據(jù)采集和營銷工具英文_第5頁
已閱讀5頁,還剩50頁未讀 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

Knowledgediscovery&datamining

Tools,methods,andexperiencesAtutorial@EDBT2000EDBT2000tutorial1Konstanz,March2000ContributorsandacknowledgementsThepeople@PisaKDDLab:FrancescoBONCHI,GiuseppeMANCO,MircoNANNI,ChiaraRENSO,SalvatoreRUGGIERI,FrancoTURINIandmanystudentsThemanyKDDtutorialistsandteacherswhichmadetheirslidesavailableontheweb(allofthemlistedinbibliography);-)Inparticular:JiaweiHAN,SimonFraserUniversity,whoseforthcomingbookDatamining:conceptsandtechniqueshasinfluencedthewholetutorialRajeevRASTOGIandKyuseokSHIM,LucentBellLabsDanielA.KEIM,UniversityofHalleDanielSilver,CogNovaTechnologiesTheEDBT2000boardwhoacceptedourtutorialproposal2EDBT2000tutorial-IntroTutorialgoalsIntroduceyoutomajoraspectsoftheKnowledgeDiscoveryProcess,andtheoryandapplicationsofDataMiningtechnologyProvideasystematizationtothemanymanyconceptsaroundthisarea,accordingthefollowinglinestheprocessthemethodsappliedtoparadigmaticcasesthesupportenvironmenttheresearchchallengesImportantissuesthatwillbenotcoveredinthistutorial:methods:timeseries,exceptiondetection,neuralnetssystems:parallelimplementations3EDBT2000tutorial-IntroTutorialOutlineIntroductionandbasicconceptsMotivations,applications,theKDDprocess,thetechniquesDeeperintoDMtechnologyDecisionTreesandFraudDetectionAssociationRulesandMarketBasketAnalysisClusteringandCustomerSegmentationTrendsintechnologyKnowledgeDiscoverySupportEnvironmentTools,LanguagesandSystemsResearchchallenges4EDBT2000tutorial-IntroIntroduction-moduleoutlineMotivationsApplicationAreasKDDDecisionalContextKDDProcessArchitectureofaKDDsystemTheKDDstepsinshort5EDBT2000tutorial-IntroEvolutionofDatabaseTechnology:

fromdatamanagementtodataanalysis1960s:Datacollection,databasecreation,IMSandnetworkDBMS.1970s:Relationaldatamodel,relationalDBMSimplementation.1980s:RDBMS,advanceddatamodels(extended-relational,OO,deductive,etc.)andapplication-orientedDBMS(spatial,scientific,engineering,etc.).1990s:Datamininganddatawarehousing,multimediadatabases,andWebtechnology.6EDBT2000tutorial-IntroMotivations

“NecessityistheMotherofInvention”Dataexplosionproblem:

Automateddatacollectiontools,maturedatabasetechnologyandinternetleadtotremendousamountsofdatastoredindatabases,datawarehousesandotherinformationrepositories.Wearedrowningininformation,butstarvingforknowledge!

(JohnNaisbett)Datawarehousinganddatamining:On-lineanalyticalprocessingExtractionofinterestingknowledge(rules,regularities,patterns,constraints)fromdatainlargedatabases.7EDBT2000tutorial-IntroAlsoreferredtoas:

Datadredging,Dataharvesting,DataarcheologyAmultidisciplinaryfield:DatabaseStatisticsArtificialintelligenceMachinelearning,ExpertsystemsandKnowledgeAcquisitionVisualizationmethodsArapidlyemergingfieldArapidlyemergingfield8EDBT2000tutorial-IntroMotivationsforDM

AbundanceofbusinessandindustrydataCompetitivefocus-KnowledgeManagementInexpensive,powerfulcomputingenginesStrongtheoretical/mathematicalfoundationsmachinelearning&logicstatisticsdatabasemanagementsystems9EDBT2000tutorial-IntroWhatisDMusefulfor?MarketingDatabaseMarketingDataWarehousingKDD&DataMining

Increaseknowledgetobasedecisionupon.E.g.,impactonmarketing10EDBT2000tutorial-IntroTheValueChainDataCustomerdataStoredataDemographicalDataGeographicaldataInformationXlivesinZSisYyearsoldXandSmovedWhasmoneyinZKnowledgeAquantityYofproductAisusedinregionZCustomersofclassYusex%ofCduringperiodDDecisionPromoteproductAinregionZ.MailadstofamiliesofprofilePCross-sellserviceBtoclientsC11EDBT2000tutorial-IntroApplicationAreasandOpportunitiesMarketing:segmentation,customertargeting,...Finance:investmentsupport,portfoliomanagementBanking&Insurance:creditandpolicyapprovalSecurity:frauddetectionScienceandmedicine:hypothesisdiscovery,prediction,classification,diagnosisManufacturing:processmodeling,qualitycontrol,resourceallocationEngineering:simulationandanalysis,patternrecognition,signalprocessingInternet:smartsearchengines,webmarketing12EDBT2000tutorial-IntroClassesofapplicationsMarketanalysistargetmarketing,customerrelationmanagement,marketbasketanalysis,crossselling,marketsegmentation.RiskanalysisForecasting,customerretention,improvedunderwriting,qualitycontrol,competitiveanalysis.FrauddetectionText(newsgroup,email,documents)andWebanalysis.13EDBT2000tutorial-IntroMarketAnalysisWherearethedatasourcesforanalysis?Creditcardtransactions,loyaltycards,discountcoupons,customercomplaintcalls,plus(public)lifestylestudies.TargetmarketingFindclustersof““model”customerswhosharethesamecharacteristics:interest,incomelevel,spendinghabits,etc.DeterminecustomerpurchasingpatternsovertimeConversionofsingletoajointbankaccount:marriage,etc.Cross-marketanalysisAssociations/co-relationsbetweenproductsalesPredictionbasedontheassociationinformation.Customerprofilingdataminingcantellyouwhattypesofcustomersbuywhatproducts(clusteringorclassification).IdentifyingcustomerrequirementsidentifyingthebestproductsfordifferentcustomersusepredictiontofindwhatfactorswillattractnewcustomersProvidessummaryinformationvariousmultidimensionalsummaryreports;statisticalsummaryinformation(datacentraltendencyandvariation)MarketAnalysisandManagementMarketAnalysis(2)RiskAnalysisFinanceplanningandassetevaluation:cashflowanalysisandpredictioncontingentclaimanalysistoevaluateassetscross-sectionalandtimeseriesanalysis(financial-ratio,trendanalysis,etc.)Resourceplanning:summarizeandcomparetheresourcesandspendingCompetition:monitorcompetitorsandmarketdirections(CI:competitiveintelligence).groupcustomersintoclassesandclass-basedpricingproceduressetpricingstrategyinahighlycompetitivemarketFraudDetectionApplications:widelyusedinhealthcare,retail,creditcardservices,telecommunications(phonecardfraud),etc.Approach:usehistoricaldatatobuildmodelsoffraudulentbehaviorandusedataminingtohelpidentifysimilarinstances.Examples:autoinsurance:detectagroupofpeoplewhostageaccidentstocollectoninsurancemoneylaundering:detectsuspiciousmoneytransactions(USTreasury'sFinancialCrimesEnforcementNetwork)medicalinsurance:detectprofessionalpatientsandringofdoctorsandringofreferencesMoreexamples:Detectinginappropriatemedicaltreatment:AustralianHealthInsuranceCommissionidentifiesthatinmanycasesblanketscreeningtestswererequested(saveAustralian$1m/yr).Detectingtelephonefraud:Telephonecallmodel:destinationofthecall,duration,timeofdayorweek.Analyzepatternsthatdeviatefromanexpectednorm.BritishTelecomidentifieddiscretegroupsofcallerswithfrequentintra-groupcalls,especiallymobilephones,andbrokeamultimilliondollarfraud.Retail:Analystsestimatethat38%ofretailshrinkisduetodishonestemployees.FraudDetection(2)SportsIBMAdvancedScoutanalyzedNBAgamestatistics(shotsblocked,assists,andfouls)togaincompetitiveadvantageforNewYorkKnicksandMiamiHeat.AstronomyJPLandthePalomarObservatorydiscovered22quasarswiththehelpofdataminingInternetWebSurf-AidIBMSurf-AidappliesdataminingalgorithmstoWebaccesslogsformarket-relatedpagestodiscovercustomerpreferenceandbehaviorpages,analyzingeffectivenessofWebmarketing,improvingWebsiteorganization,etc.WatchforthePRIVACYpitfall!OtherapplicationsTheselectionandprocessingofdatafor:theidentificationofnovel,accurate,andusefulpatterns,andthemodelingofreal-worldphenomena.DataminingisamajorcomponentoftheKDDprocess-automateddiscoveryofpatternsandthedevelopmentofpredictiveandexplanatorymodels.WhatisKDD?Aprocess!20EDBT2000tutorial-IntroSelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseDataSourcesPatterns&ModelsPreparedDataConsolidatedDataTheKDDprocess21EDBT2000tutorial-IntroTheKDDProcessCoreProblems&ApproachesProblems:identificationofrelevantdatarepresentationofdatasearchforvalidpatternormodelApproaches:top-downdeductionbyexpertinteractivevisualizationofdata/models*bottom-upinductionfromdata*DataMiningOLAP22EDBT2000tutorial-IntroLearningtheapplicationdomain:relevantpriorknowledgeandgoalsofapplicationDataconsolidation:CreatingatargetdatasetSelectionandPreprocessingDatacleaning:(maytake60%ofeffort!)Datareductionandprojection:findusefulfeatures,dimensionality/variablereduction,invariantrepresentation.Choosingfunctionsofdataminingsummarization,classification,regression,association,clustering.Choosingtheminingalgorithm(s)Datamining:searchforpatternsofinterestInterpretationandevaluation:analysisofresults.visualization,transformation,removingredundantpatterns,…UseofdiscoveredknowledgeThestepsoftheKDDprocessIdentifyProblemorOpportunityMeasureeffectofActionActonKnowledgeKnowledgeResultsStrategyProblemThevirtuouscycle24EDBT2000tutorial-IntroApplications,operations,techniques25EDBT2000tutorial-IntroRolesintheKDDprocess26EDBT2000tutorial-IntroIncreasingpotentialtosupportbusinessdecisionsEndUserBusinessAnalystDataAnalystDBAMakingDecisionsDataPresentationVisualizationTechniquesDataMiningInformationDiscoveryDataExplorationOLAP,MDAStatisticalAnalysis,QueryingandReportingDataWarehouses/DataMartsDataSourcesPaper,Files,InformationProviders,DatabaseSystems,OLTPDataminingandbusinessintelligence27EDBT2000tutorial-IntroGraphicalUserInterfaceDataConsolidationSelectionandPreprocessingDataMiningInterpretationandEvaluationWarehouseKnowledgeDataSourcesArchitectureofaKDDsystem28EDBT2000tutorial-IntroAbusinessintelligenceenvironment29EDBT2000tutorial-IntroSelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseDataSourcesPatterns&ModelsPreparedDataConsolidatedDataTheKDDprocess30EDBT2000tutorial-IntroGarbageinGarbageoutThequalityofresultsrelatesdirectlytoqualityofthedata50%-70%ofKDDprocesseffortisspentondataconsolidationandpreparationMajorjustificationforacorporatedatawarehouseDataconsolidationandpreparation31EDBT2000tutorial-IntroFromdatasourcestoconsolidateddatarepositoryRDBMSLegacyDBMSFlatFilesDataConsolidationandCleansingWarehouseObject/RelationDBMSMultidimensionalDBMSDeductiveDatabaseFlatfilesExternalDataconsolidation32EDBT2000tutorial-IntroDeterminepreliminarylistofattributesConsolidatedataintoworkingdatabaseInternalandExternalsourcesEliminateorestimatemissingvaluesRemoveoutliers(obviousexceptions)DeterminepriorprobabilitiesofcategoriesanddealwithvolumebiasDataconsolidation33EDBT2000tutorial-IntroSelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseTheKDDprocess34EDBT2000tutorial-IntroGenerateasetofexampleschoosesamplingmethodconsidersamplecomplexitydealwithvolumebiasissuesReduceattributedimensionalityremoveredundantand/orcorrelatingattributescombineattributes(sum,multiply,difference)ReduceattributevaluerangesgroupsymbolicdiscretevaluesquantizecontinuousnumericvaluesTransformdatade-correlateandnormalizevaluesmaptime-seriesdatatostaticrepresentationOLAPandvisualizationtoolsplaykeyroleDataselectionandpreprocessing35EDBT2000tutorial-IntroSelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseTheKDDprocess36EDBT2000tutorial-IntroDataminingtasksandmethodsAutomatedExploration/Discoverye.g..discoveringnewmarketsegmentsclusteringanalysisPrediction/Classificatione.g..forecastinggrosssalesgivencurrentfactorsregression,neuralnetworks,geneticalgorithms,decisiontreesExplanation/Descriptione.g..characterizingcustomersbydemographicsandpurchasehistorydecisiontrees,associationrulesx1x2f(x)xifage>35andincome<$35kthen...37EDBT2000tutorial-IntroClustering:partitioningasetofdataintoasetofclasses,calledclusters,whosememberssharesomeinterestingcommonproperties.Distance-basednumericalclusteringmetricgroupingofexamples(K-NN)graphicalvisualizationcanbeusedBayesianclusteringsearchforthenumberofclasseswhichresultinbestfitofaprobabilitydistributiontothedataAutoClass(NASA)oneofbestexamplesAutomatedexplorationanddiscovery38EDBT2000tutorial-IntroLearningapredictivemodelClassificationofanewcase/sampleManymethods:ArtificialneuralnetworksInductivedecisiontreeandrulesystemsGeneticalgorithmsNearestneighborclusteringalgorithmsStatistical(parametric,andnon-parametric)Predictionandclassification39EDBT2000tutorial-IntroTheobjectiveoflearningistoachievegoodgeneralizationtonewunseencases.GeneralizationcanbedefinedasamathematicalinterpolationorregressionoverasetoftrainingpointsModelscanbevalidatedwithapreviouslyunseentestsetorusingcross-validationmethodsf(x)xGeneralizationandregression40EDBT2000tutorial-IntroClassificationandpredictionClassifydatabasedonthevaluesofatargetattribute,e.g.,classifycountriesbasedonclimate,orclassifycarsbasedongasmileage.Useobtainedmodeltopredictsomeunknownormissingattributevaluesbasedonotherinformation.41EDBT2000tutorial-IntroObjective:DevelopageneralmodelorhypothesisfromspecificexamplesFunctionapproximation(curvefitting)Classification(conceptlearning,patternrecognition)x1x2ABf(x)xSummarizing:inductivemodeling=learning42EDBT2000tutorial-IntroLearnageneralizedhypothesis(model)fromselecteddataDescription/InterpretationofmodelprovidesnewknowledgeMethods:InductivedecisiontreeandrulesystemsAssociationrulesystemsLinkAnalysis…Explanationanddescription43EDBT2000tutorial-IntroGenerateamodelofnormalactivityDeviationfrommodelcausesalertMethods:ArtificialneuralnetworksInductivedecisiontreeandrulesystemsStatisticalmethodsVisualizationtoolsException/deviationdetection44EDBT2000tutorial-IntroOutlierandexceptiondataanalysisTime-seriesanalysis(trendanddeviation):Trendanddeviationanalysis:regression,sequentialpattern,similarsequences,trendanddeviation,e.g.,stockanalysis.Similarity-basedpattern-directedanalysisFullvs.partialperiodicityanalysisOtherpattern-directedorstatisticalanalysis45EDBT2000tutorial-IntroSelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationandWarehousingKnowledgep(x)=0.02WarehouseTheKDDprocess46EDBT2000tutorial-IntroAdataminingsystem/querymaygeneratethousandsofpatterns,notallofthemareinteresting.Interestingnessmeasures:easilyunderstoodbyhumansvalidonnewortestdatawithsomedegreeofcertainty.potentiallyusefulnovel,orvalidatessomehypothesisthatauserseekstoconfirmObjectivevs.subjectiveinterestingnessmeasuresObjective:basedonstatisticsandstructuresofpatterns,e.g.,support,confidence,etc.Subjective:basedonuser’sbeliefsinthedata,e.g.,unexpectedness,novelty,etc.Areallthediscoveredpatterninteresting?Findalltheinterestingpatterns:Completeness.Canadataminingsystemfindalltheinterestingpatterns?Searchforonlyinterestingpatterns:Optimization.Canadataminingsystemfindonlytheinterestingpatterns?ApproachesFirstgenerateallthepatternsandthenfilterouttheuninterestingones.Generateonlytheinterestingpatterns-miningqueryoptimization.Completenessvs.optimizationEvaluationStatisticalvalidationandsignificancetestingQualitativereviewbyexpertsinthefieldPilotsurveystoevaluatemodelaccuracyInterpretationInductivetreeandrulemodelscanbereaddirectlyClusteringresultscanbegraphedandtabledCodecanbeautomaticallygeneratedbysomesystems(IDTs,Regressionmodels)Interpretationandevaluation49EDBT2000tutorial-IntroVisualizationtoolscanbeveryhelpfulsensitivityanalysis(I/Orelationship)histogramsofvaluedistributiontime-seriesplotsandanimationrequirestrainingandpracticeResponseVelocityTempInterpretationandevaluation50EDBT2000tutorial-Intro1989IJCAIWorkshoponKDDKnowledgeDiscoveryinDatabases(G.Piatetsky-ShapiroandW.Frawley,eds.,1991)1991-1994WorkshopsonKDDAdvancesinKnowledgeDiscoveryandDataMining(U.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,eds.,1996)1995-1998AAAIInt.Conf.onKDDandDM(KDD’95-98)JournalofDataMiningandKnowledgeDiscovery(1997)1998ACMSIGKDD1999SIGKDD’99Conf.ImportantdatesofdataminingReferences-generalP.AdriaansandD.Zantinge.DataMining.Addison-Wesley:Harlow,England,1996.M.S.Chen,J.Han,andP.S.Yu.Datamining:Anoverviewfromadatabaseperspective.IEEETrans.KnowledgeandDataEngineering,8:866-883,1996.U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy.AdvancesinKnowledgeDiscoveryandDataMining.AAAI/MITPress,1996.J.HanandM.Kamber.DataMining:ConceptsandTechniques.MorganKaufmann,2000.Toappear.T.ImielinskiandH.Mannila.Adatabaseperspectiveonknowledgediscovery.CommunicationsofACM,39:58-64,1996.G.Piatetsky-Shapiro,U.Fayyad,andP.Smith.Fromdataminingtoknowledgediscovery:Anoverview.InU.M.Fayyad,etal.(eds.),AdvancesinKnowledgeDiscoveryandDataMining,1-35.AAAI/MITPress,1996.G.Piatetsky-ShapiroandW.J.Frawley.KnowledgeDiscoveryinDatabases.AAAI/MITPress,1991.MichaelBerry&GordonLinoff.DataMiningTechniquesforMarketing,SalesandCustomerSupport.JohnWiley&Sons,1997.SholomM.WeissandNitinIndurkhya.PredictiveDataMining:APracticalGuide.MorganKaufmann,1997.W.H.Inmon,J.D.Welch,KatherineL.Glassey.Managingthedatawarehouse.Wiley,1997.T.Mitchell.MachineLearning.McGraw-Hill,1997.52EDBT2000tutorial-IntroMainWebresources53EDBT2000tutorial-IntroTutorialOutlineIntroductionandbasicconceptsMotivations,applications,theKDDprocess,thetechniquesDeeperintoDMtechnologyDecisionTreesandFraudDetectionAssociationRulesandMarketBasketAnalysisClusteringandCustomerSegmentationTrendsintechnologyKnowledgeDiscoverySupportEnvironmentTools,LanguagesandSystemsResearchchallenges54EDBT2000tutorial-Intro9、靜夜四無鄰鄰,荒居舊業(yè)業(yè)貧。。1月-231月-23Saturday,January7,202310、雨雨中中黃黃葉葉樹樹,,燈燈下下白白頭頭人人。。。。20:27:5520:27:5520:271/7/20238:27:55PM11、以我我獨沈沈久,,愧君君相見見頻。。。1月-2320:27:5520:27Jan-2307-Jan-2312、故人江海別別,幾度隔山山川。。20:27:5520:27:5520:27Saturday,January7,202313、乍見翻翻疑夢,,相悲各各問年。。。1月-231月-2320:27:5520:27:55January7,

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論