




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1
DataMining:UnlimitedNewResearchFrontiersJiaweiHanDataMiningResearchGroupDepartmentofComputerScienceUniversityofIllinoisatUrbana-ChampaignAcknowledgements:NSF,ARL,NASA,AFOSR(MURI),DHS,Microsoft,IBM,Yahoo!,HPLab&Boeing04May20232
OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsDataMiningandDataWarehousing
JiaweiHan’sGroupatCS,UIUC
MiningpatternsandknowledgediscoveryfrommassivedataDatamininginheterogeneousinformationnetworksExploringbroadapplicationsofdataminingDevelopedmanyeffectivedataminingalgorithms,e.g.,FPgrowth,PrefixSpan,gSpan,StarCubing,CrossMine,RankingCube,CrossClus,RankClus,andNetClus600+researchpapersinconferencesandjournalsFellowofACM,FellowofIEEE,ACMSIGKDDInnovationAward,W.McDowellAward,DanielDruckerEminentFacultyAwardTextbook,“Datamining:ConceptsandTechniques,”adoptedworldwideProjectleadforNASAEventCubeforAviationSafety[2023-2023]DirectorofInformationNetworkAcademicResearchCenterfundedfromArmyResearchLab(ARL)[2023-2023]3
DataMiningResearchGroupatCS,UIUC4
NewBooksonDataMining&LinkMining5
Han,KamberandPei,DataMining,3rded.2023Yu,HanandFaloutsos(eds.),LinkMining,2023SunandHan,MiningHeterogeneousInformationNetworks,20236
OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsMiningHeterogeneousInformationNetworksRankClus/NetClusVS.RankCompete:ACompetingRandomWalkModelforRank-BasedClusteringDatabaseDataMiningAIIRTop-5rankedconferencesVLDBKDDIJCAISIGIRSIGMODSDMAAAIECIRICDEICDMICMLCIKMPODSPKDDCVPRWWWEDBTPAKDDECMLWSDMTop-5rankedtermsdatamininglearningretrievaldatabasedataknowledgeinformationqueryclusteringreasoningwebsystemclassificationlogicsearchxmlfrequentcognitiontextRankClass[KDD11]KnowledgePropagationinHeterogeneousNetwork8SimilaritySearchandRoleDiscoveryinInformationNetworksPath:ITIPath:ITIGITIWhichimagesaremostsimilartomeinFlickr?PathSim[VLDB11]MetaPath-GuidedSimilaritySearchinNetworksA“dirty”InformationNetwork(imaginary)Cleaned/InferredAdversarialNetworkChiefInsurgentCellLeadAutomaticallyinferRoleDiscoveryinInformationNetworks[KDD’10]AdviseeTopRankedAdvisorTimeNoteDavidM.Blei1.MichaelI.Jordan01-03PhDadvisor,20232.JohnD.Lafferty05-06Postdoc,2023HongCheng1.QiangYang02-03MSadvisor,20232.JiaweiHan04-08PhDadvisor,2023SergeyBrin1.RajeevMotawani97-98UnofficialadvisorInterestingResultsfromOtherDomainsRankCompete:Organizeyourphotoalbumautomatically!RanktreatmentsforAIDSfromMEDLINE9
Meta-PathBasedCo-authorshipPredictioninDBLPCo-authorshippredictionproblemWhethertwoauthorsaregoingtocollaborateforthefirsttimeCo-authorshipencodedinmeta-pathAuthor-Paper-AuthorTopologicalfeaturesencodedinmeta-pathsMeta-pathsbetweenauthorsunderlength4Meta-PathSemanticMeaning10
ThePowerofPathPredictExplainthepredictionpowerofeachmeta-pathWaldTestforlogisticregressionHigherpredictionaccuracythanusingprojectedhomogeneousnetwork7%higherinpredictionaccuracySocialrelationsplaymoreimportantrole?11
CaseStudy:PredictingConcreteCo-AuthorsHighqualitypredictivepowerforsuchadifficulttask12UsingdatainT0=[1989;1995]andT1=[1996;2023]PredictnewcoauthorrelationshipinT2=[2023;2023]13
OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsIntuitions:Friendstendtoholdsimilaropinions,whilefoestendtoholdconflictingopinionsBasedonusers’sentimentscoresondifferentobjects,wecaninferthesimilarityanddissimilarity(i.e.,pseudo-friendandpseudo-foerelationship)betweenusersBasedontheinferredfriendship,wecanimprovesentimentanalysisanduserclusteringbyconsideringglobalconsistencyonheterogeneousnetworksState-of-the-ArtExploresimilaropinionsinsteadofoppositeopinionsTypicallyconsidertextcontentwhileignoreInfoNetRelyonobservedfriendship(butmanyarehidden)IndustryNeed/BenefitsUsesentimentanalysistounderstandandminepublicopinionsonproduct/market-relatedissuesQoI-awareminingoftext-richmulti-genrenetworksIntelligentmethodsforpublicopinionassessment14Insight:Exploringoppositeopinionsmayhelptodiscoverhiddenfriendship,whichcanproducebettersentimentscoresanduserclustering.(mutuallyenhanceeachother)1.Considerinformationandsocialnetworks2.Exploreoppositeandsimilaropinions3.Bothobserved&hiddenfriendshiparevaluableObservedfriendshipObserved+InferredfriendshipRefiningsentimentscoresInferhiddenfriendshipPseudo-foeWhatpeoplearethinkingaboutcertaintarget/issue?ExploringOppositeOpinionsandDissimilaritiesforSentimentAnalysisLong-TermGoalsStructurallymodelatext-richmulti-genrenetworkandinvestigatemethodsforminingknowledgefromsuchnetworksEnhancesearchandknowledgediscoverycapabilityintext-richmulti-genrenetworkmodelSimilarstudiesoftext-richmulti-genrenetworkscanleadtonovelmodelsforminingandsearchrichtextualdatainsocialmediaforbusinessapplicationsApproaches15
Developanewmetapath-basedmeasureforinferringsimilarityanddissimilaritybetweenobjectsandusers,basedonsentimentscoresDevelopagraph-basedsemi-supervisedrefiningmodeltopropagatethesentimentscoresfromlabeleddatatounlabeleddataResult:Forsentimentclassification,ourmodelwith40labeleddataachievesbetterperformancethanSVM-basedmodelwith600trainingdata;foruserclustering,theresultsoninferredfriendshipcarrymoresemanticmeaningthantheresultsonobservedfriendship.ObservedfriendshipClusteringresultsofNormalizedCutsondifferentgraphsIdealresults(ground-truth)Observed+InferredEachclusterisagroupofuserswithsimilaropinionswrt3candidatesSVM-basedsupervisedmodelwith600trainingdataBetterthanExploringOppositeOpinionsandDissimilaritiesforSentimentAnalysis(2)TruthDiscovery:FromTruth-FindertoLatentTruthModel(1)State-of-the-ArtHITS-likeRandomWalkmethods(e.g.,TruthFinder(KDD’07),3-Estimate(WSDM’10),Investment(COLING’10),etc.)Limitations:(1)Qualityasasinglevalue,cannotwellsupportmultipletrueattributesforeachentity;and(2)basedonheuristics,notprincipledprobabilisticmodels.IndustryNeed/BenefitsIntegrateentity-attributedatabasesfrommultiplesources,e.g.salescampaign/productopinions,etc.Automaticallylearnthequalityofeachdatasourceandthemostaccurateintegratedrecords16Insight:Somesourcestendtomisstrueattributes(FalseNegatives),whilesomeotherstendtoproducefalseattributes(FalsePositives).Modelingtwo-sidedqualityiskeytosupportingmultipletruevaluesperentityfortruthfinding.ContributionsofLatentTruthModelAPrincipledProbabilisticModelModelnegativeclaimsandtwo-sidedsourcequalitywithBayesianregularizationNaturallysupportmultipletrueattributevaluesLTMcannaturallyincorporatepriordomainknowledgethroughBayesianpriorsLTMcanrunineitherbatchoronlinestreamingmodesforincrementaltruthfindingIMDBNegativeClaimPositiveClaimGenerateImplicitNegativeClaims:HarryPotterQualityofSourcesObservationofClaimsTruthofFactsGenerativeProcessinLTM:1)Foreachsource,generatefalsepositiverate(withstrongprior)andsensitivity(withuniformprior).2)Foreachfactf,generatepriortruthprobabilityandtruthlabel.3)Foreachclaim,generateobservationbasedontruthlabelandcorrespondingsourcequality.NetflixBadSourceCorrectClaimIncorrectClaimHighPrecision,HighRecallHighPrecision,LowRecallLowPrecision,LowRecallLong-TermGoalsTruth-findingmodelsformoregeneraldatatypes(numericalattributes,etc.)Modelsourcequalityinotherdataintegrationtasks,e.g.entityresolution.Trustworthinessinmulti-genrenetworks(text-richnetworks,socialnetworks,etc.)TruthDiscovery:FromTruth-FindertoLatentTruthModel(2)17Result:
Outperformstate-of-the-artmethodsontworealworlddatasets:booksandmoviesLTMisalsoveryscalable:seeLTM_incExperimentaldatasets:LargeandrealBookAuthorsfrom(1263books,879sources,48153claims,2420book-author,100labeled)MovieDirectorsfromBing(15073movies,12sources,108873claims,33526movie-director,100labeled)Varyingcutoffthreshold(consistentlybetter)RunningTime18
OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsEventCube:AnOverviewMultidimensionalTextDatabase98.0199.0299.0198.02LAXSJCMIAAUSovershootundershootbirdsturbulenceTimeLocationTopicCAFLTXLocation19981999TimeDeviationEncounterTopicdrill-downroll-upEventCubeRepresentationAnalyst…MultidimensionalOLAP,Ranking,CauseAnalysis,TopicSummarization/Comparison……
AnalysisSupport19
EventCube:AnOrganizedApproachforMiningandUnderstandingAnomalousAviationEventsFundedbyNASA$1.2M(2023-now)Text/TopicCube:GeneralIdeaHeterogeneous:categoricalattributes+unstructuredtextHowtocombine?Oursolution:TimeLocationPlaceEnvironment……Event
ReportACNTextdataCube:CategoricalAttributesTerm/TopicWeightT1W1T2W2T3W3……Text/TopicModel:UnstructuredTextMeasure20
EffectiveKeywordSearchTopCells(ICDE’10):Rankingaggregatedcells(objects)inTextCube.HealthcareReformTopCellsSystemPerson:Obama,Year:2023Org:Congress,Year:2023Person:Hillary,Year:2023…21
EffectiveOLAPExplorationTEXplorer(CIKM’11):Integratingkeyword-basedrankingandOLAPexplorationHealthcareReformTEXplorerSystemTop-1Dimension:PersonTop-2Dimension:OrgTop-3Dimension:Time20232023202322
EffectiveEventTrackingPET(KDD’10):trackingpopularityandtextualrepresentationofeventsinsocialcommunities(twitter)debate,cost,senate,…pass,success,law,…HealthcareReformPopularEventTrackingSystemTimePopularityContentFeb2023Mar2023Apr2023benefit,profit,effective,…23
24
OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusionsGrowingParallelPaths(WWW2023)
Result:25
MappingPagestoRecords(CIKM’10)Databaserecordscanbefoundonlinkpaths!26
WinaCS:WebInformationNetworkAnalysisforComputerScienceIntegrationofWebstructureminingandinformationnetworkanalysisTimWeninger,MarinaDanilevsky,etal.,“WinaCS:ConstructionandAnalysisofWeb-BasedComputerScienceInformationNetworks",ACMSIGMOD'11(systemdemo),Athens,Greece,June2023.27
28
OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusions29DiscoveryofSwarmsandPeriodicPatternsinMovingObjectDataAsystemthatminesmovingobjectpatterns:Z.Li,etal.,“MoveMine:MiningMovingObjectDatabases",SIGMOD’10(systemdemo)Z.Li,B.Ding,J.Han,andR.Kays,“MiningHiddenPeriodicBehaviorsforMovingObjects”,KDD’10(sub)Z.Li,B.Ding,J.Han,andR.Kays,“Swarm:MiningRelaxedTemporalMovingObjectClusters”,VLDB’10(sub)←BirdflyingpathsshownonGoogleEarthMinedperiodicpatternsbyournewmethod→←ConvoydiscoversonlyrestrictedpatternsSwarm
discoversmorepatterns→GeoTopicDiscovery:MiningSpatialTextLDMTDMGeoFolkLGTAGeo-taggedphotosw.landscape(coastvs.desertvs.mountain)30
Z.Yin,eta.,GeoTopicDiscoveryandComparison,WWW'1131
OutlineAnIntroductiontoDataMiningResearchGroupMiningandOLAPingInformationNetworksMiningHeterogeneousInformationNetworksMiningText-RichInformationNetworksOLAPing(Multi-dimensionalanalysis)ofinformationnetworks:TextCube,OLAPheterogeneousnetworksTamingtheWeb:WINACS(IntegratedminingofWebstructuresandcontents)MiningCyber-PhysicalSystemsandNetworksConclusions32
Conclusions:MiningBigandComplexDataMiningbigdata:AcriticalpartofbigdatainitiativesMostdataobjectsareinterconnected,formingheterogeneousinformationnetworksMostdatasetscanbe“organized”or“transformed”into“structured”multi-typed,heterogeneous
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年度辦公用品銷售折扣及定制服務(wù)合同
- 二零二五年度燃料研發(fā)與專利許可合同
- 二零二五年度股權(quán)代持合同:包含企業(yè)資產(chǎn)重組的綜合性協(xié)議
- 2025年度環(huán)保設(shè)施勞務(wù)分包安全責(zé)任協(xié)議
- 二零二五年度頂管施工安全教育與應(yīng)急預(yù)案制定合同
- 二零二五年度合資企業(yè)股份代持與清算協(xié)議
- 二零二五年度特色餐飲服務(wù)人員勞動(dòng)合同范本
- 二零二五年度地質(zhì)鉆孔施工環(huán)境保護(hù)協(xié)議
- 二零二五年度集體勞動(dòng)合同在民營(yíng)企業(yè)中的創(chuàng)新
- 二零二五年度企業(yè)產(chǎn)品售后服務(wù)宣傳片委托協(xié)議
- 單層廠房鋼結(jié)構(gòu)設(shè)計(jì)T83
- 5S點(diǎn)檢表1(日檢查表)
- 醫(yī)院感染管理組織架構(gòu)圖
- 帶你看認(rèn)養(yǎng)一頭牛品牌調(diào)研
- 雙鴨山玄武巖纖維及其制品生產(chǎn)基地項(xiàng)目(一期)環(huán)評(píng)報(bào)告表
- 冠心病病人的護(hù)理ppt(完整版)課件
- 砂石生產(chǎn)各工種安全操作規(guī)程
- (精心整理)林海雪原閱讀題及答案
- 云南藝術(shù)學(xué)院
- 2020華夏醫(yī)學(xué)科技獎(jiǎng)知情同意報(bào)獎(jiǎng)證明
- 素描石膏幾何體
評(píng)論
0/150
提交評(píng)論