版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Part IData Mining FundamentalsChapter 1: Data Mining: A First View第1頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM2Content1.1 What is Data Mining? Definition1.2 What can computers Learn?1.3 Is Data Mining Appropriate for My Problem?1.4 Expert Systems or Data Mining?1.6 Why Not Simple Search?第2頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM31
2、.1 What is data mining: MotivationData explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. Such amount of data beyond human understanding. We are drowning in data,
3、 but starving for knowledge! Solution: Data warehousing and data miningData warehousing: for data storageData mining: for Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases第3頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM41.1 Data Mining is a result of natural ev
4、olution of information technology1960s:Data collection and database creation1970s - early 1980s: Database Management SystemsMid-1980s - present:Data warehouseData analysis and understanding (data mining)第4頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM5Data Analysis:New TrendThis is a time that one must speak with data.未
5、來(lái)屬于運(yùn)算師 (Super Crunchers超級(jí)運(yùn)算師, Ian Ayres, 2009):日常決策將變得越來(lái)越自動(dòng)化,人的判斷作用將局限于為計(jì)算提供數(shù)據(jù)葡萄酒味道和香味的預(yù)測(cè):奧利.阿申費(fèi)爾特是普林斯頓大學(xué)的經(jīng)濟(jì)學(xué)家,完全不懂葡萄酒的制作,但可以預(yù)測(cè)波爾多葡萄酒的價(jià)格基于天氣(炎熱、干燥的年份酒會(huì)非常好),準(zhǔn)確率高于葡萄酒專家本書(shū)原計(jì)劃叫“理論的終結(jié)”,后來(lái)利用google改書(shū)名而不是與出版社編輯討論,因?yàn)榘l(fā)現(xiàn)用此名點(diǎn)擊率高63%放貸員曾經(jīng)收入優(yōu)厚、職責(zé)最大,現(xiàn)在只是呼叫中心的接線員,重復(fù)電腦提示的問(wèn)題,報(bào)酬很低第5頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM6Data Analys
6、is:New Trend (cont.)This is a time that one must speak with data.基因測(cè)序和新物種:克雷格.文特爾使用能夠分析數(shù)據(jù)的高速計(jì)算機(jī),從給單個(gè)生物基因排序,2003年開(kāi)始給海洋測(cè)序,2005年給空氣測(cè)序。這個(gè)過(guò)程中發(fā)現(xiàn)了數(shù)千種以前不知道的細(xì)菌和其它生命形式。他對(duì)生物學(xué)的推進(jìn)比同輩所有人都大。第6頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM7在過(guò)去,上海通用保修問(wèn)題分析主要依靠簡(jiǎn)單的純手工處理的計(jì)算方式,每次只能產(chǎn)生寥寥幾篇問(wèn)題報(bào)告。盡管汽車生產(chǎn)量遠(yuǎn)不如現(xiàn)在大,但這個(gè)耗時(shí)費(fèi)力的分析周期卻在根本上導(dǎo)致了保修成本居高不下。在非自動(dòng)操作環(huán)
7、境下,從保修索賠出現(xiàn)到找出問(wèn)題原因平均要花費(fèi)612個(gè)月的時(shí)間,且在此間往往還需要借助于通用全球的支持,解決問(wèn)題的整個(gè)過(guò)程也主要建立在經(jīng)驗(yàn)分析的基礎(chǔ)上。另外,不準(zhǔn)確的數(shù)據(jù)導(dǎo)致上海通用難以準(zhǔn)確預(yù)測(cè)保修成本,從而合理準(zhǔn)備下一周期的保修預(yù)算,導(dǎo)致大量運(yùn)營(yíng)資金被占用、現(xiàn)金流降低。 采用SAS的保修分析解決方案后,上海通用的保修分析周期在頭6個(gè)月里就縮短了70%,有效地降低了保修成本,實(shí)現(xiàn)了該系統(tǒng)使用的預(yù)期目標(biāo)。同時(shí),這些顯著的改善效果幫助上海通用在短短半年內(nèi)就收回了保修分析系統(tǒng)所有的軟硬件投資,共為公司節(jié)省了1,800萬(wàn)人民幣的成本。 警察地理信息系統(tǒng)第7頁(yè),共39頁(yè)。2022/8/3BUPT AI&D
8、M8Data Mining Definitions(1) The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data. (in this text book)(2) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from da
9、ta in large databases. (generally accepted)第8頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM9Induction-based Learning(基于歸納的學(xué)習(xí))Data mining methods use induction-based learningThe process of forming general concept definitions by observing specific examples of concepts to be learned.第9頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM10 What Is Da
10、ta Mining?Alternative names: Data mining or knowledge mining? Gold mining - poor analogyKnowledge discovery in databases (KDD), business intelligence第10頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM11Why Data Mining? Potential Applications (or p4)Database analysis and decision supportMarket analysis and managementtarget
11、 marketing, cross selling, market segmentationRisk analysis and managementForecasting, customer retention, quality controlFraud detection and managementOther ApplicationsText mining (news group, email, documents) and Web analysis.第11頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM12Content1.1 What is Data Mining? Definiti
12、on1.2 What can computers Learn?Four Levels of Learning(略)Three Concept Views (略)Supervised LearningUnsupervised Learning1.3 Is Data Mining Appropriate for My Problem?1.4 Expert Systems or Data Mining?1.6 Why Not Simple Search?第12頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM131.2.1 Supervised Learning Build a learner mo
13、del using data instances of known origin. Use the model to determine the outcome of new instances of unknown origin.第13頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM14Attributes: input attributes, output attributesProcess: Training Data ,Test DataLearning outcome: tree, production rules第14頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM15第15頁(yè)
14、,共39頁(yè)。2022/8/3BUPT AI&DM16Decision tree: A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes (leaf nodes) reflect decision outcomes.root node第16頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM17Production Rules (產(chǎn)生式規(guī)則)IF Swollen Glands = Yes THEN Diagnosis = Strep ThroatI
15、F Swollen Glands = No & Fever = Yes THEN Diagnosis = ColdIF Swollen Glands = No & Fever = No THEN Diagnosis = AllergyAntecedent conditions: 先決條件Consequent conditions:結(jié)論第17頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM181.2.2 Unsupervised ClusteringA data mining method that builds models from data without predefined clas
16、ses. 第18頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM19The Acme Investors Dataset第19頁(yè),共39頁(yè)。The Acme Investors Dataset & Supervised Learning Can I develop a general profile of an online investor?Can I determine if a new customer is likely to open a margin account?Can I build a model to accurately predict the average num
17、ber of trades per month for a new investor?What characteristics differentiate female and male investors?第20頁(yè),共39頁(yè)。What attribute similarities group customers of Acme Investors together?What differences in attribute values segment the customer database? The Acme Investors Dataset & Unsupervised Clust
18、ering第21頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM22IF Margin Account = Yes & Age=20-29 & Annual Income = 40-59kTHEN Cluster = 1accuracy=0.80, coverage=0.50IF Account Type = Custodial & Favorite Recreation = Skiing & Annual Income = 80-90kTHEN Cluster = 2accuracy=0.95, coverage=0.35IF Account Type = Joint & Trades/M
19、onth 5 & Transaction Method = OnlineTHEN Cluster = 3accuracy=0.82, coverage=0.65(see example clusters on p13)第22頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM23Content1.1 What is Data Mining? Definition1.2 What can computers Learn?1.3 Is Data Mining Appropriate for My Problem? (Data Mining vs Data Query)1.4 Expert Syste
20、ms or Data Mining?1.6 Why Not Simple Search?第23頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM24Data Mining or Data Query? Shallow Knowledge: Shallow knowledge is factual. It can be easily stored and manipulated in a database. Multidimensional Knowledge: Multidimensional knowledge is also factual. On-line analytical Proc
21、essing (OLAP) tools are used to manipulate multidimensional knowledge. Hidden Knowledge: Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease (example p15). Deep Knowledge: Deep
22、 knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for. 第24頁(yè),共39頁(yè)。Data Mining vs. Data Query: An Example (p16) Use data query if you already almost know what you are looking for. Use data mining to find regularities in data th
23、at are not obvious. 第25頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM26Content1.1 What is Data Mining? Definition1.2 What can computers Learn?1.3 Is Data Mining Appropriate for My Problem? (Data Mining vs Data Query)1.4 Expert Systems or Data Mining? (Data Mining vs ES)1.6 Why Not Simple Search?第26頁(yè),共39頁(yè)。2022/8/3BUPT AI
24、&DM271.4 Expert Systems or Data Mining?Expert System (ES): A computer program that emulates the problem-solving skills of one or more human experts. Used when no (quality) data available, or, in the field where human has good knowledge in it.Experts learn their skills by education and experience.Hum
25、an experts often use rules to describe what they know.ES and DM can work together.第27頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM28第28頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM29Content1.1 What is Data Mining? Definition1.2 What can computers Learn?1.3 Is Data Mining Appropriate for My Problem? (Data Mining vs Data Query)1.4 Expert Sy
26、stems or Data Mining? (Data Mining vs ES)1.6 Why Not Simple Search? (Data Mining vs Nearest Neighbor Approach)第29頁(yè),共39頁(yè)。2022/8/3BUPT AI&DM301.6 Why Not Simple Search?Stores instances or generalized model of the data. Nearest Neighbor ClassifierClassification is performed by searching the training data for the instance closest in distance to the unknown instance.Advantage: suitable for areas where human has limited knowledgeProblem: Slow when number of cases is largeAttribu
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年度車庫(kù)門防火安全檢測(cè)與維護(hù)服務(wù)合同4篇
- 二零二五年度農(nóng)業(yè)科技成果轉(zhuǎn)化合同4篇
- 2025年度綠色建筑窗戶安裝與節(jié)能評(píng)估合同4篇
- 2025年羊毛衫片項(xiàng)目可行性研究報(bào)告
- 2025年能源行業(yè)碳排放交易與減排合同3篇
- 2025年度車輛抵押借款合同范本模板3篇
- 2025年油毛氈原紙行業(yè)深度研究分析報(bào)告
- 2025年塑料梳子項(xiàng)目可行性研究報(bào)告
- 2025年中國(guó)家用醫(yī)療器械行業(yè)市場(chǎng)調(diào)研分析及投資戰(zhàn)略咨詢報(bào)告
- 溫州市2025年度二手房交易市場(chǎng)發(fā)展趨勢(shì)研究報(bào)告合同3篇
- 不同茶葉的沖泡方法
- 光伏發(fā)電并網(wǎng)申辦具體流程
- 建筑勞務(wù)專業(yè)分包合同范本(2025年)
- 企業(yè)融資報(bào)告特斯拉成功案例分享
- 運(yùn)動(dòng)技能學(xué)習(xí)與控制完整
- 食管癌的早期癥狀和手術(shù)治療
- 垃圾分類和回收利用課件
- 北侖區(qū)建筑工程質(zhì)量監(jiān)督站監(jiān)督告知書(shū)
- 法考客觀題歷年真題及答案解析卷一(第1套)
- 央國(guó)企信創(chuàng)白皮書(shū) -基于信創(chuàng)體系的數(shù)字化轉(zhuǎn)型
- 6第六章 社會(huì)契約論.電子教案教學(xué)課件
評(píng)論
0/150
提交評(píng)論