版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、An Introduction to Data MiningDiscovering hidden value in your data warehouseOverviewData mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses
2、. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data min
3、ing tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.Most companies already collect and refine massive quantities of dat
4、a. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel pro
5、cessing computers, data mining tools can analyze massive databases to deliver answers to questions such as, Which clients are most likely to respond to my next promotional mailing, and why?This white paper provides an introduction to the basic technologies of data mining. Examples of profitable appl
6、ications illustrate its relevance to todays business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.The Foundations of Data MiningData mining techniques are the result of a long process of research and produc
7、t development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective d
8、ata access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: Massive data collection Powerful multiprocessor computers Data mining algorithms
9、 Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much large
10、r. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, under
11、standable tools that consistently outperform older statistical methods.In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store
12、large databases is critical to data mining. From the users point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly.Evolutionary StepBusiness QuestionEnabling TechnologiesProduct ProvidersCharacteristicsData
13、Collection (1960s)What was my total revenue in the last five years?Computers, tapes, disksIBM, CDCRetrospective, static data deliveryData Access (1980s)What were unit sales in New England last March?Relational databases (RDBMS), Structured Query Language (SQL), ODBCOracle, Sybase, Informix, IBM, Mic
14、rosoftRetrospective, dynamic data delivery at record levelData Warehousing & Decision Support(1990s)What were unit sales in New England last March? Drill down to Boston.On-line analytic processing (OLAP), multidimensional databases, data warehousesPilot, Comshare, Arbor, Cognos, MicrostrategyRetrosp
15、ective, dynamic data delivery at multiple levelsData Mining (Emerging Today)Whats likely to happen to Boston unit sales next month? Why?Advanced algorithms, multiprocessor computers, massive databasesPilot, Lockheed, IBM, SGI, numerous startups (nascent industry)Prospective, proactive information de
16、liveryTable 1. Steps in the Evolution of Data Mining.The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance rela
17、tional database engines and broad data integration efforts, make these technologies practical for current data warehouse environments.The Scope of Data MiningData mining derives its name from the similarities between searching for valuable business information in a large database for example, findin
18、g linked products in gigabytes of store scanner data and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, dat
19、a mining technology can generate new business opportunities by providing these capabilities: Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now
20、 be answered directly from the data quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bank
21、ruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery
22、is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. Data mining techniques
23、 can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive dat
24、abases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions. Databases can be larger in both depth and
25、 breadth: More columns. Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explo
26、re the full depth of a database, without preselecting a subset of variables. More rows. Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population. A recent Gartner Group Advanced Technology Research Note listed da
27、ta mining and artificial intelligence at the top of the five key technology areas that will clearly have a major impact across a wide range of industries within the next 3 to 5 years.2 Gartner also listed parallel architectures and data mining as two of the top 10 new technologies in which companies
28、 will invest during the next 5 years. According to a recent Gartner HPC Research Note, With the rapid advance in data capture, transmission and storage, large-systems users will increasingly need to implement new and innovative ways to mine the after-market value of their vast stores of detail data,
29、 employing MPP massively parallel processing systems to create new sources of business advantage (0.9 probability).3 The most commonly used techniques in data mining are: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in s
30、tructure. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . Genetic al
31、gorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) mos
32、t similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Many of these technologies have been in use for more than a decade in specialized analysis tools
33、 that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms. The appendix to this white paper provides a glossary of data mining terms.How Data Mining WorksHow exactly is data mining able to tell
34、 you important things that you didnt know or what is going to happen next? The technique that is used to perform these feats in data mining is called modeling. Modeling is simply the act of building a model in one situation where you know the answer and then applying it to another situation that you
35、 dont. For instance, if you were looking for a sunken Spanish galleon on the high seas the first thing you might do is to research the times when Spanish treasure had been found by others in the past. You might note that these ships often tend to be found off the coast of Bermuda and that there are
36、certain characteristics to the ocean currents, and certain routes that have likely been taken by the ships captains in that era. You note these similarities and build a model that includes the characteristics that are common to the locations of these sunken treasures. With these models in hand you s
37、ail off looking for treasure where your model indicates it most likely might be given a similar situation in the past. Hopefully, if youve got a good model, you find your treasure.This act of model building is thus something that people have been doing for a long time, certainly before the advent of
38、 computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with lots of information about a variety of situations where an answer is known and then the data mining software on the computer must run through
39、 that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you dont know the answer. For example, say that you are the director of marketing for a telecommunications company and youd like to acquire so
40、me new long distance phone customers. You could just randomly go out and mail coupons to the general population - just as you could randomly sail the seas looking for sunken treasure. In neither case would you achieve the results you desired and of course you have the opportunity to do much better t
41、han random - you could use your business experience stored in your database to build a model.As the marketing director you have access to a lot of information about all of your customers: their age, sex, credit history and long distance calling usage. The good news is that you also have a lot of inf
42、ormation about your prospective customers: their age, sex, credit history etc. Your problem is that you dont know the long distance calling usage of these prospects (since they are most likely now customers of your competition). Youd like to concentrate on those prospects who have large amounts of l
43、ong distance usage. You can accomplish this by building a model. Table 2 illustrates the data used for building a model for new customer prospecting in a data warehouse.CustomersProspectsGeneral information (e.g. demographic data)KnownKnownProprietary information (e.g. customer transactions)KnownTar
44、getTable 2 - Data Mining for ProspectingThe goal in prospecting is to make some calculated guesses about the information in the lower right hand quadrant based on the model that we build going from Customer General Information to Customer Proprietary Information. For instance, a simple model for a t
45、elecommunications company might be:98% of my customers who make more than $60,000/year spend more than $80/month on long distanceThis model could then be applied to the prospect data to try to tell something about the proprietary information that this telecommunications company does not currently ha
46、ve access to. With this model in hand new customers can be selectively targeted.Test marketing is an excellent source of data for this kind of modeling. Mining the results of a test market representing a broad but relatively small sample of prospects can provide a foundation for identifying good pro
47、spects in the overall market. Table 3 shows another common scenario for building models: predict what is going to happen in the future.YesterdayTodayTomorrowStatic information and current plans (e.g. demographic data, marketing plans)KnownKnownKnownDynamic information (e.g. customer transactions)Kno
48、wnKnownTargetTable 3 - Data Mining for PredictionsIf someone told you that he had a model that could predict customer usage how would you know if he really had a good model? The first thing you might try would be to ask him to apply his model to your customer base - where you already knew the answer
49、. With data mining, the best way to accomplish this is by setting aside some of your data in a vault to isolate it from the mining process. Once the mining is complete, the results can be tested against the data held in the vault to confirm the models validity. If the model works, its observations s
50、hould hold for the vaulted data.An Architecture for Data MiningTo best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra step
51、s for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout
52、 the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse. Figure 1 - Integrated Data Mining ArchitectureThe ideal starting point is a data warehouse conta
53、ining a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database system
54、s: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to an
55、alyze the data as they want to view their business summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An adv
56、anced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows
57、 with new decisions and results, the organization can continually mine the best practices and apply them to future decisions.This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software,
58、 the Advanced Analysis Server applies users business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting
59、, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.Profitable ApplicationsA wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-i
60、ntensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehou
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年度養(yǎng)老護理機構(gòu)與醫(yī)院合作推進老年健康信息化合同3篇
- 二零二五年度林業(yè)生態(tài)保護-鄉(xiāng)土樹種采購與生態(tài)保護合同
- 二零二五年度奢侈品分期付款購買合同3篇
- 二零二五年度農(nóng)村住房建設(shè)項目管理合同
- 二零二五年度內(nèi)架承包與施工合同履約保證金協(xié)議3篇
- 二零二五年度航空俱樂部駕駛員聘用合同協(xié)議書3篇
- 二零二五年度綠色能源項目合同報價書2篇
- 2025年度公司倉庫貨物儲存服務(wù)合同3篇
- 二零二五年度農(nóng)村民宿裝修包工包料項目合同
- 2025年度全屋衣柜定制環(huán)保材料與智能家居產(chǎn)品銷售合同3篇
- 穴位貼敷護理培訓(xùn)
- 腰椎間盤突出癥護理查房課件
- 建德海螺二期施工組織設(shè)計
- 山東省菏澤市2023-2024學(xué)年高一上學(xué)期期末測試物理試題(解析版)
- 2024年學(xué)校后勤日用品采購合同范本2篇
- DB45T 2866-2024 靈芝菌種制備技術(shù)規(guī)程
- 2024年度區(qū)塊鏈軟件產(chǎn)品知識產(chǎn)權(quán)共享協(xié)議3篇
- 人教版九年級上學(xué)期物理期末復(fù)習(xí)(壓軸60題28大考點)
- 人教版(2024版)七年級上冊英語期末模擬測試卷(含答案)
- 2024年度企業(yè)環(huán)境、社會及治理(ESG)咨詢合同6篇
- 幼兒園中班美術(shù)活動《美麗的線條》課件
評論
0/150
提交評論