數(shù)據(jù)挖掘標準規(guī)范工具和發(fā)展趨勢概述_第1頁
數(shù)據(jù)挖掘標準規(guī)范工具和發(fā)展趨勢概述_第2頁
數(shù)據(jù)挖掘標準規(guī)范工具和發(fā)展趨勢概述_第3頁
數(shù)據(jù)挖掘標準規(guī)范工具和發(fā)展趨勢概述_第4頁
數(shù)據(jù)挖掘標準規(guī)范工具和發(fā)展趨勢概述_第5頁
已閱讀5頁,還剩40頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

數(shù)據(jù)挖掘第八章:標準規(guī)范、工具和發(fā)展趨勢本章內(nèi)容8.1數(shù)據(jù)挖掘標準與規(guī)范8.2數(shù)據(jù)挖掘工具8.3數(shù)據(jù)挖掘的研究趨勢基本要求:了解數(shù)據(jù)挖掘在應(yīng)用中的相關(guān)標準規(guī)范及未來的研究趨勢。8.1數(shù)據(jù)挖掘標準與規(guī)范數(shù)據(jù)挖掘過程模型是確保數(shù)據(jù)挖掘工作順利進行的關(guān)鍵。典型的過程模型有:SPSS的5A模型——評估(Assess)、訪問(Access)、分析(Analyze)、行動(Act)、自動化(Automate)SAS的SEMMA模型——采樣(Sample)、探索(Explore)、修正(Modify)、建模(Model)、評估(Assess)跨行業(yè)數(shù)據(jù)挖掘過程標準CRISP-DM(CrossIndustryStandardProcessforDataMining)。TwoCrows公司的數(shù)據(jù)挖掘過程模型,它與正在建立的CRISP-DM有許多相似之處。

數(shù)據(jù)挖掘相關(guān)標準CRISP-DM(交叉行業(yè)數(shù)據(jù)挖掘過程標準,CrossIndustryStandardProcessforDataMining)。SPSS、NCR以及DaimlerChrysler三個在數(shù)據(jù)挖掘領(lǐng)域經(jīng)驗豐富的公司發(fā)起建立一個社團,目的建立數(shù)據(jù)挖掘方法和過程的標準8.1數(shù)據(jù)挖掘標準與規(guī)范Crisp-DMProjectObjectivesDataUnderstandingDataPreparationModelingEvaluationReportingBackground

Requirements,assumptions,constraints

Terminology

Datamininggoals&successcriteria

ProjectplanInitialDatacollectionreport

Datadescriptionreport

DataExplorationreport

DataqualityreportDatadescriptionreport

Datapre-processingstepsModelingassumption

TestdesignModeldescription

Modelassessment(inc.validation)Assessmentofdataminingresultswith

respecttoobjectivesFinalreport:Summary: Objectives DataMiningprocess DataMiningresults DataMiningassessment

-ConclusionsFuturework(BusinessUnderstanding)(Deployment)WidelyacceptedPROCESSMODELfordataminingProvidesaframeworkfordescribingthemodelingprocessindetail“BESTPRACTICE”BusinessUnderstandingPhaseUnderstandthebusinessobjectivesWhatisthestatusquo?UnderstandbusinessprocessesAssociatedcosts/painDefinethesuccesscriteriaDevelopaglossaryofterms:speakthelanguageCost/BenefitAnalysisCurrentSystemsAssessmentIdentifythekeyactorsMinimum:TheSponsorandtheKeyUserWhatformsshouldtheoutputtake?IntegrationofoutputwithexistingtechnologylandscapeUnderstandmarketnormsandstandards8.1數(shù)據(jù)挖掘標準與規(guī)范BusinessUnderstandingPhaseTaskDecompositionBreakdowntheobjectiveintosub-tasksMapsub-taskstodataminingproblemdefinitionsIdentifyConstraintsResourcesLawe.g.DataProtectionBuildaprojectplanListassumptionsandrisk(technical/financial/business/organisational)factors8.1數(shù)據(jù)挖掘標準與規(guī)范DataUnderstandingPhaseCollectDataWhatarethedatasources?InternalandExternalSources(e.g.Axiom,Experian)Documentreasonsforinclusion/exclusionsDependonadomainexpertAccessibilityissuesArethereissuesregardingdatadistributionacrossdifferentdatabases/legacysystemsWherearethedisconnects?8.1數(shù)據(jù)挖掘標準與規(guī)范DataUnderstandingPhaseDataDescriptionDocumentdataqualityissuesComputebasicstatisticsDataExplorationSimpleunivariatedataplots/distributionsInvestigateattributeinteractionsDataQualityIssuesMissingValues:UnderstanditssourceStrangeDistributions8.1數(shù)據(jù)挖掘標準與規(guī)范DataPreparationPhaseIntegrateDataJoiningmultipledatatablesSummarisation/aggregationofdata

SelectDataAttributesubsetselectionRationaleforInclusion/ExclusionDatasamplingTraining/ValidationandTestsets8.1數(shù)據(jù)挖掘標準與規(guī)范DataPreparationPhaseDataTransformationUsingfunctionssuchaslogFactor/PrincipalComponentsanalysisNormalization/Discretisation/BinarisationCleanDataHandlingmissingvalues/OutliersDataConstructionDerivedAttributes8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范TheModelingPhaseBuildModelChooseinitialparametersettingsStudymodelbehaviour:SensitivityanalysisAssessthemodelBewareofover-fittingInvestigatetheerrordistribution:IdentifysegmentsofthestatespacewherethemodelislesseffectiveIterativelyadjustparametersettings8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范TheEvaluationPhaseValidateModelHumanevaluationofresultsbydomainexpertsEvaluateusefulnessofresultsfrombusinessperspectiveDefinecontrolgroupsCalculateliftcurvesExpectedReturnonInvestmentReviewProcessDeterminenextstepsPotentialfordeploymentDeploymentarchitectureMetricsforsuccessofdeployment8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范PMML(預(yù)預(yù)測測模模型型標標記記語語言言,,PredictiveModelMarkupLanguage)。。數(shù)據(jù)據(jù)挖挖掘掘應(yīng)應(yīng)用用往往往往需需要要多多種種類類型型的的數(shù)數(shù)據(jù)據(jù)挖挖掘掘軟軟件件、、算算法法協(xié)協(xié)同同運運行行,,這這就就要要求求對對挖挖掘掘出出的的模模型型能能夠夠很很好好地地繼繼承承、、復(fù)復(fù)用用與與集集成成。DMG(TheDataMiningGroup,DMG)提出出PMML語言言。。PMML最新新版版本本為為4.1,支支持持16種數(shù)數(shù)據(jù)據(jù)挖挖掘掘模模型型,,包包括括::AssociationModel(關(guān)關(guān)聯(lián)聯(lián)規(guī)規(guī)則則))、、BaselineModel(基基準準模模型型))、、ClusteringModel(聚聚類類模模型型))、、GeneralRegressionModel(回回歸歸模模型型))、、MiningModel(組組合合模模型型))、、NaiveBayesModel(樸樸素素貝貝葉葉斯斯))、、NearestNeighborModel(最最近近鄰鄰模模型型))NeuralNetwork(神神經(jīng)經(jīng)網(wǎng)網(wǎng)絡(luò)絡(luò)))、、RegressionModel(線線性性、、多多項項式式、、對對數(shù)數(shù)三三種種回回歸歸模模型型))、、RuleSetModel(規(guī)規(guī)則則集集))、、SequenceModel(序序列列模模式式))、Scorecard、TimeSeriesModel、SupportVectorMachineModel(支支持持向向量量機機))、、TextModel(文文本本模模型型))、、TreeModel(決決策策樹樹))8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范PMML的模模型型定定義義由由以以下下幾幾部部分分組組成成:8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范TheheaderelementcontainsgeneralinformationaboutthePMMLdocument,suchascopyrightformationforthemodel,itsdescription,andinformationabouttheapplicationusedtogeneratethemodelsuchasnameandversion.8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范<PMMLversion="3.2"...<Headercopyright="Copyright(c)2009Togaware"description="RPartDecisionTree"><Extensionname="timestamp"value="2009-02-1506:51:50"extender="Rattle"/><Extensionname="description"value="iristree"extender="Rattle"/><Applicationname="Rattle/PMML"version="1.2.7"/></Header>Thedatadictionaryrecordsinformationaboutthedata?eldsfromwhichthemodelwasbuilt.8.1數(shù)據(jù)挖掘掘標準與與規(guī)范<DataDictionarynumberOfFields="5"><DataFieldname="Species"...<Valuevalue="setosa"/><Valuevalue="versicolor"/><Valuevalue="virginica"/><DataFieldname="Sepal.Length"optype="continuous"dataType="double"/></DataField>DataTransformations:transformationsallowforthemappingofuserdataintoamoredesirableformtobeusedbytheminingmodel.PMMLdefinesseveralkindsofsimpledatatransformations.Normalization:mapvaluestonumbers,theinputcanbecontinuousordiscrete.Discretization:mapcontinuousvaluestodiscretevalues.Valuemapping:mapdiscretevaluestodiscretevalues.Functions(customandbuilt-in):deriveavaluebyapplyingafunctiontooneormoreparameters.Aggregation:usedtosummarizeorcollectgroupsofvalues.8.1數(shù)據(jù)挖掘掘標準與與規(guī)范Model:containsthedefinitionofthedataminingmodel.ModelName(attributemodelName)AlgorithmName(attributealgorithmName)NumberofLayers(attributenumberOfLayers)MiningSchema:listsallfieldsusedinthemodel.Name:mustrefertoafieldinthedatadictionaryUsagetype:definesthewayafieldistobeusedinthemodel.Typicalvaluesare:active,predicted,andsupplementary.Predictedfieldsarethosewhosevaluesarepredictedbythemodel.OutlierTreatment:definestheoutliertreatmenttobeuse.MissingValueReplacementPolicy:ifthisattributeisspecifiedthenamissingvalueisautomaticallyreplacedbythegivenvalues.MissingValueTreatment:indicateshowthemissingvaluereplacementwasderived.8.1數(shù)據(jù)挖掘掘標準與與規(guī)范Targets:allowforpost-processingofthepredictedvalueintheformatofscalingiftheoutputofthemodeliscontinuous.8.1數(shù)據(jù)挖掘掘標準與與規(guī)范PMMLExample:AssociationRule:8.1數(shù)據(jù)挖掘掘標準與與規(guī)范t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,WaterModelattributesItemsPMMLExample:AssociationRule:8.1數(shù)據(jù)挖掘掘標準與與規(guī)范t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,Water<AssocItemsetid="1"support="1.0"numberOfItems="1"/><AssocItemRefitemRef="1"/></AssocItemset><AssocItemsetid="2"support="1.0"numberOfItems="1"/><AssocItemRefitemRef="3"/></AssocItemset><!--andonefrequentitemsetwithtwoitems.--><AssocItemsetid="3"support="1.0"numberOfItems="2"/><AssocItemRefitemRef="1"/><AssocItemRefitemRef="3"/></AssocItemset><!--Tworulessatisfytherequirements--><AssocRulesupport="1.0"confidence="1.0"antecedent="1"consequent="2"/><AssocRulesupport="1.0"confidence="1.0"antecedent="2"consequent="1"/></AssociationModel></PMML>ItemSetsAssociationRulesJDM((JavaDataMiningAPI))。旨在提供供一個訪訪問數(shù)據(jù)據(jù)挖掘工工具的標標準API,支持數(shù)數(shù)據(jù)挖掘掘模型的的建立、、使用,,數(shù)據(jù)及及元數(shù)據(jù)據(jù)的創(chuàng)建建、存儲儲、訪問問及維護護,從而而使得Java應(yīng)用程序序能夠能能夠方便便集成數(shù)數(shù)據(jù)挖掘掘技術(shù)。。8.1數(shù)據(jù)挖掘掘標準與與規(guī)范SemanticWeb相關(guān)標準準TimBerners-Lee在XML2000會議報告告中首次次提出了了語義Web的層次模型型(LayerCake)。其特點點在與::基于XML和RDF/RDFS,構(gòu)建本本體和邏邏輯推理理規(guī)則,,以完成成基于語語義的知知識表示示和推理理,從而而為計算算機所理理解和處處理。8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范第一一層是是Unicode(統(tǒng)統(tǒng)一一編編碼碼))和和URI(UniformResourceIdentifier,統(tǒng)統(tǒng)一一資資源源標標識識器器))。。UNICODE于1993年成成為為國國際際標標準準組組織織ISO的一一項項國國際際標標準準ISO/IEC10646,其其宗宗旨旨是是全全球球所所有有文文種種統(tǒng)統(tǒng)一一編編碼碼。。URI包含含三三個個部部分分:被被用用來來訪訪問問資資源源的的統(tǒng)統(tǒng)一一命命名名規(guī)規(guī)則則分分配配體體系系、、資資源源宿宿主主機機器器的的名名稱稱、、路路徑徑形形式式的的資資源源名名稱稱。。與與URL本不不同同的的是是,,URI只是是一一個個標標識識符符,,不不直直接接提提供供訪訪問問資資源源的的方方法法。。8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范第二二層層是是XML(EXtensibleMarkupLanguage)。。XML具有有簡簡單單、、自自描描述述、、可可擴擴展展的的特特點點,,并并且且實實現(xiàn)現(xiàn)了了內(nèi)內(nèi)容容、、結(jié)結(jié)構(gòu)構(gòu)和和表表現(xiàn)現(xiàn)三三者者的的分分離離,,因因而而,,更更適適合合于于數(shù)數(shù)據(jù)據(jù)表表示示和和交交換換。。XMLSchema中的的約約束束主主要要用用于于XML文檔檔的的結(jié)結(jié)構(gòu)構(gòu)合合法法性性驗驗證證。。第三三層層是是RDF(ResourceDescriptionFramework,資資源源描描述述框框架架))。。元元數(shù)數(shù)據(jù)據(jù)層層。。RDF是建建立立在在XML上的的元元數(shù)數(shù)據(jù)據(jù)描描述述與與交交換換框框架架,,以以““資資源源((Resource)--屬屬性性((Property)--屬屬性性值值((PropertyValue)””的的形式式描描述述對對象象。。一個例例子子8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范8.1數(shù)據(jù)據(jù)挖挖掘掘標標準準與與規(guī)范范第四四層層是是RDF-S(RDFSchema)。。RDF-S是對RDF的擴展,是RDF的詞匯描述語語言(VocabularyDescriptionLanguage),用于定義RDF資源描述文件件中出現(xiàn)的詞詞匯。第五層是本體體(Ontology)和規(guī)則(Rule)。領(lǐng)域知識識層。OWL用于明確表示示詞匯體系中中的術(shù)語及術(shù)術(shù)語間的關(guān)系系,在詞義和和語義的表達達來說,OWL有更強的表達達能力。規(guī)則用于描述述領(lǐng)域知識中中的前提和結(jié)結(jié)論。SPARQL(SimpleProtocolandRDFQueryLanguage)是W3C推薦的用于對對RDF數(shù)據(jù)查詢的語語言和協(xié)議。。8.1數(shù)據(jù)挖掘標準準與規(guī)范本章內(nèi)容8.1數(shù)據(jù)挖掘標準準與規(guī)范8.2數(shù)據(jù)挖掘工具具8.3數(shù)據(jù)挖掘的研究趨勢Freeopen-sourcedataminingsoftwareandapplicationsGATE:a

naturallanguageprocessing

andlanguageengineeringtool.Orange:Acomponent-baseddataminingand

machinelearningsoftwaresuitewritteninthePython

language.R:A

programminglanguage

andsoftwareenvironmentforstatisticalcomputing,datamining,andgraphics.RapidMiner:Anenvironmentfor

machinelearninganddataminingexperiments.UIMA:TheUIMA(UnstructuredInformationManagementArchitecture)isacomponentframeworkforanalyzingunstructuredcontentsuchastext,audioandvideo––originallydevelopedbyIBM.Weka:Asuiteofmachinelearningsoftwareapplicationswritteninthe

Java

programminglanguage.8.2數(shù)據(jù)挖掘工具Commercialdata-miningsoftwareandapplicationsIBMSPSSModeler:dataminingsoftwareprovidedbyIBM.MicrosoftAnalysisServices:dataminingsoftwareprovidedbyMicrosoft.OracleDataMining:dataminingsoftwarebyOracle.SASEnterpriseMiner:dataminingsoftwareprovidedbytheSASInstitute.STATISTICADataMiner:dataminingsoftwareprovidedbyStatSoft.8.2數(shù)據(jù)挖掘工具8.2數(shù)據(jù)挖掘工具MainFeatures49datapreprocessingtools76classification/regressionalgorithms8clusteringalgorithms3algorithmsforfindingassociationrules15attribute/subsetevaluators+10searchalgorithmsforfeatureselectionMainGUI“TheExplorer”(exploratorydataanalysis)“TheExperimenter”(experimentalenvironment)“TheKnowledgeFlow”(newprocessmodelinspiredinterface)8.2數(shù)據(jù)挖掘工具WEKAon

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論