數(shù)據(jù)挖掘研究前沿韓家煒_第1頁(yè)
數(shù)據(jù)挖掘研究前沿韓家煒_第2頁(yè)
數(shù)據(jù)挖掘研究前沿韓家煒_第3頁(yè)
數(shù)據(jù)挖掘研究前沿韓家煒_第4頁(yè)
數(shù)據(jù)挖掘研究前沿韓家煒_第5頁(yè)
已閱讀5頁(yè),還剩28頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1

DataMining:UnlimitedNewResearchFrontiersJiaweiHanDataMiningResearchGroupDepartmentofComputerScienceUniversityofIllinoisatUrbana-ChampaignAcknowledgements:NSF,ARL,NASA,AFOSR(MURI),DHS,Microsoft,IBM,Yahoo!,HPLab&Boeing04May20232

OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsDataMiningandDataWarehousing

JiaweiHan’sGroupatCS,UIUC

MiningpatternsandknowledgediscoveryfrommassivedataDatamininginheterogeneousinformationnetworksExploringbroadapplicationsofdataminingDevelopedmanyeffectivedataminingalgorithms,e.g.,FPgrowth,PrefixSpan,gSpan,StarCubing,CrossMine,RankingCube,CrossClus,RankClus,andNetClus600+researchpapersinconferencesandjournalsFellowofACM,FellowofIEEE,ACMSIGKDDInnovationAward,W.McDowellAward,DanielDruckerEminentFacultyAwardTextbook,“Datamining:ConceptsandTechniques,”adoptedworldwideProjectleadforNASAEventCubeforAviationSafety[2023-2023]DirectorofInformationNetworkAcademicResearchCenterfundedfromArmyResearchLab(ARL)[2023-2023]3

DataMiningResearchGroupatCS,UIUC4

NewBooksonDataMining&LinkMining5

Han,KamberandPei,DataMining,3rded.2023Yu,HanandFaloutsos(eds.),LinkMining,2023SunandHan,MiningHeterogeneousInformationNetworks,20236

OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsMiningHeterogeneousInformationNetworksRankClus/NetClusVS.RankCompete:ACompetingRandomWalkModelforRank-BasedClusteringDatabaseDataMiningAIIRTop-5rankedconferencesVLDBKDDIJCAISIGIRSIGMODSDMAAAIECIRICDEICDMICMLCIKMPODSPKDDCVPRWWWEDBTPAKDDECMLWSDMTop-5rankedtermsdatamininglearningretrievaldatabasedataknowledgeinformationqueryclusteringreasoningwebsystemclassificationlogicsearchxmlfrequentcognitiontextRankClass[KDD11]KnowledgePropagationinHeterogeneousNetwork8SimilaritySearchandRoleDiscoveryinInformationNetworksPath:ITIPath:ITIGITIWhichimagesaremostsimilartomeinFlickr?PathSim[VLDB11]MetaPath-GuidedSimilaritySearchinNetworksA“dirty”InformationNetwork(imaginary)Cleaned/InferredAdversarialNetworkChiefInsurgentCellLeadAutomaticallyinferRoleDiscoveryinInformationNetworks[KDD’10]AdviseeTopRankedAdvisorTimeNoteDavidM.Blei1.MichaelI.Jordan01-03PhDadvisor,20232.JohnD.Lafferty05-06Postdoc,2023HongCheng1.QiangYang02-03MSadvisor,20232.JiaweiHan04-08PhDadvisor,2023SergeyBrin1.RajeevMotawani97-98UnofficialadvisorInterestingResultsfromOtherDomainsRankCompete:Organizeyourphotoalbumautomatically!RanktreatmentsforAIDSfromMEDLINE9

Meta-PathBasedCo-authorshipPredictioninDBLPCo-authorshippredictionproblemWhethertwoauthorsaregoingtocollaborateforthefirsttimeCo-authorshipencodedinmeta-pathAuthor-Paper-AuthorTopologicalfeaturesencodedinmeta-pathsMeta-pathsbetweenauthorsunderlength4Meta-PathSemanticMeaning10

ThePowerofPathPredictExplainthepredictionpowerofeachmeta-pathWaldTestforlogisticregressionHigherpredictionaccuracythanusingprojectedhomogeneousnetwork7%higherinpredictionaccuracySocialrelationsplaymoreimportantrole?11

CaseStudy:PredictingConcreteCo-AuthorsHighqualitypredictivepowerforsuchadifficulttask12UsingdatainT0=[1989;1995]andT1=[1996;2023]PredictnewcoauthorrelationshipinT2=[2023;2023]13

OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsIntuitions:Friendstendtoholdsimilaropinions,whilefoestendtoholdconflictingopinionsBasedonusers’sentimentscoresondifferentobjects,wecaninferthesimilarityanddissimilarity(i.e.,pseudo-friendandpseudo-foerelationship)betweenusersBasedontheinferredfriendship,wecanimprovesentimentanalysisanduserclusteringbyconsideringglobalconsistencyonheterogeneousnetworksState-of-the-ArtExploresimilaropinionsinsteadofoppositeopinionsTypicallyconsidertextcontentwhileignoreInfoNetRelyonobservedfriendship(butmanyarehidden)IndustryNeed/BenefitsUsesentimentanalysistounderstandandminepublicopinionsonproduct/market-relatedissuesQoI-awareminingoftext-richmulti-genrenetworksIntelligentmethodsforpublicopinionassessment14Insight:Exploringoppositeopinionsmayhelptodiscoverhiddenfriendship,whichcanproducebettersentimentscoresanduserclustering.(mutuallyenhanceeachother)1.Considerinformationandsocialnetworks2.Exploreoppositeandsimilaropinions3.Bothobserved&hiddenfriendshiparevaluableObservedfriendshipObserved+InferredfriendshipRefiningsentimentscoresInferhiddenfriendshipPseudo-foeWhatpeoplearethinkingaboutcertaintarget/issue?ExploringOppositeOpinionsandDissimilaritiesforSentimentAnalysisLong-TermGoalsStructurallymodelatext-richmulti-genrenetworkandinvestigatemethodsforminingknowledgefromsuchnetworksEnhancesearchandknowledgediscoverycapabilityintext-richmulti-genrenetworkmodelSimilarstudiesoftext-richmulti-genrenetworkscanleadtonovelmodelsforminingandsearchrichtextualdatainsocialmediaforbusinessapplicationsApproaches15

Developanewmetapath-basedmeasureforinferringsimilarityanddissimilaritybetweenobjectsandusers,basedonsentimentscoresDevelopagraph-basedsemi-supervisedrefiningmodeltopropagatethesentimentscoresfromlabeleddatatounlabeleddataResult:Forsentimentclassification,ourmodelwith40labeleddataachievesbetterperformancethanSVM-basedmodelwith600trainingdata;foruserclustering,theresultsoninferredfriendshipcarrymoresemanticmeaningthantheresultsonobservedfriendship.ObservedfriendshipClusteringresultsofNormalizedCutsondifferentgraphsIdealresults(ground-truth)Observed+InferredEachclusterisagroupofuserswithsimilaropinionswrt3candidatesSVM-basedsupervisedmodelwith600trainingdataBetterthanExploringOppositeOpinionsandDissimilaritiesforSentimentAnalysis(2)TruthDiscovery:FromTruth-FindertoLatentTruthModel(1)State-of-the-ArtHITS-likeRandomWalkmethods(e.g.,TruthFinder(KDD’07),3-Estimate(WSDM’10),Investment(COLING’10),etc.)Limitations:(1)Qualityasasinglevalue,cannotwellsupportmultipletrueattributesforeachentity;and(2)basedonheuristics,notprincipledprobabilisticmodels.IndustryNeed/BenefitsIntegrateentity-attributedatabasesfrommultiplesources,e.g.salescampaign/productopinions,etc.Automaticallylearnthequalityofeachdatasourceandthemostaccurateintegratedrecords16Insight:Somesourcestendtomisstrueattributes(FalseNegatives),whilesomeotherstendtoproducefalseattributes(FalsePositives).Modelingtwo-sidedqualityiskeytosupportingmultipletruevaluesperentityfortruthfinding.ContributionsofLatentTruthModelAPrincipledProbabilisticModelModelnegativeclaimsandtwo-sidedsourcequalitywithBayesianregularizationNaturallysupportmultipletrueattributevaluesLTMcannaturallyincorporatepriordomainknowledgethroughBayesianpriorsLTMcanrunineitherbatchoronlinestreamingmodesforincrementaltruthfindingIMDBNegativeClaimPositiveClaimGenerateImplicitNegativeClaims:HarryPotterQualityofSourcesObservationofClaimsTruthofFactsGenerativeProcessinLTM:1)Foreachsource,generatefalsepositiverate(withstrongprior)andsensitivity(withuniformprior).2)Foreachfactf,generatepriortruthprobabilityandtruthlabel.3)Foreachclaim,generateobservationbasedontruthlabelandcorrespondingsourcequality.NetflixBadSourceCorrectClaimIncorrectClaimHighPrecision,HighRecallHighPrecision,LowRecallLowPrecision,LowRecallLong-TermGoalsTruth-findingmodelsformoregeneraldatatypes(numericalattributes,etc.)Modelsourcequalityinotherdataintegrationtasks,e.g.entityresolution.Trustworthinessinmulti-genrenetworks(text-richnetworks,socialnetworks,etc.)TruthDiscovery:FromTruth-FindertoLatentTruthModel(2)17Result:

Outperformstate-of-the-artmethodsontworealworlddatasets:booksandmoviesLTMisalsoveryscalable:seeLTM_incExperimentaldatasets:LargeandrealBookAuthorsfrom(1263books,879sources,48153claims,2420book-author,100labeled)MovieDirectorsfromBing(15073movies,12sources,108873claims,33526movie-director,100labeled)Varyingcutoffthreshold(consistentlybetter)RunningTime18

OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsEventCube:AnOverviewMultidimensionalTextDatabase98.0199.0299.0198.02LAXSJCMIAAUSovershootundershootbirdsturbulenceTimeLocationTopicCAFLTXLocation19981999TimeDeviationEncounterTopicdrill-downroll-upEventCubeRepresentationAnalyst…MultidimensionalOLAP,Ranking,CauseAnalysis,TopicSummarization/Comparison……

AnalysisSupport19

EventCube:AnOrganizedApproachforMiningandUnderstandingAnomalousAviationEventsFundedbyNASA$1.2M(2023-now)Text/TopicCube:GeneralIdeaHeterogeneous:categoricalattributes+unstructuredtextHowtocombine?Oursolution:TimeLocationPlaceEnvironment……Event

ReportACNTextdataCube:CategoricalAttributesTerm/TopicWeightT1W1T2W2T3W3……Text/TopicModel:UnstructuredTextMeasure20

EffectiveKeywordSearchTopCells(ICDE’10):Rankingaggregatedcells(objects)inTextCube.HealthcareReformTopCellsSystemPerson:Obama,Year:2023Org:Congress,Year:2023Person:Hillary,Year:2023…21

EffectiveOLAPExplorationTEXplorer(CIKM’11):Integratingkeyword-basedrankingandOLAPexplorationHealthcareReformTEXplorerSystemTop-1Dimension:PersonTop-2Dimension:OrgTop-3Dimension:Time20232023202322

EffectiveEventTrackingPET(KDD’10):trackingpopularityandtextualrepresentationofeventsinsocialcommunities(twitter)debate,cost,senate,…pass,success,law,…HealthcareReformPopularEventTrackingSystemTimePopularityContentFeb2023Mar2023Apr2023benefit,profit,effective,…23

24

OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsGrowingParallelPaths(WWW2023)

Result:25

MappingPagestoRecords(CIKM’10)Databaserecordscanbefoundonlinkpaths!26

WinaCS:WebInformationNetworkAnalysisforComputerScienceIntegrationofWebstructureminingandinformationnetworkanalysisTimWeninger,MarinaDanilevsky,etal.,“WinaCS:ConstructionandAnalysisofWeb-BasedComputerScienceInformationNetworks",ACMSIGMOD'11(systemdemo),Athens,Greece,June2023.27

28

OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusions29DiscoveryofSwarmsandPeriodicPatternsinMovingObjectDataAsystemthatminesmovingobjectpatterns:Z.Li,etal.,“MoveMine:MiningMovingObjectDatabases",SIGMOD’10(systemdemo)Z.Li,B.Ding,J.Han,andR.Kays,“MiningHiddenPeriodicBehaviorsforMovingObjects”,KDD’10(sub)Z.Li,B.Ding,J.Han,andR.Kays,“Swarm:MiningRelaxedTemporalMovingObjectClusters”,VLDB’10(sub)←BirdflyingpathsshownonGoogleEarthMinedperiodicpatternsbyournewmethod→←ConvoydiscoversonlyrestrictedpatternsSwarm

discoversmorepatterns→GeoTopicDiscovery:MiningSpatialTextLDMTDMGeoFolkLGTAGeo-taggedphotosw.landscape(coastvs.desertvs.mountain)30

Z.Yin,eta.,GeoTopicDiscoveryandComparison,WWW'1131

OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusions32

Conclusions:MiningBigandComplexDataMiningbigdata:AcriticalpartofbigdatainitiativesMostdataobjectsareinterconnected,formingheterogeneousinformationnetworksMostdatasetscanbe“organized”or“transformed”into“structured”multi-typed,heterogeneous

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論