講義數(shù)據(jù)庫課件3prep_第1頁
講義數(shù)據(jù)庫課件3prep_第2頁
講義數(shù)據(jù)庫課件3prep_第3頁
講義數(shù)據(jù)庫課件3prep_第4頁
講義數(shù)據(jù)庫課件3prep_第5頁
已閱讀5頁,還剩52頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、October 3, 2022Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques Slides for Textbook Chapter 3 Jiawei Han and Micheline KamberDepartment of Computer Science University of Illinois at Urbana-Champaign October 3, 2022Data Mining: Concepts and Techniques2Chapter 3: Data Preproce

2、ssingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryOctober 3, 2022Data Mining: Concepts and Techniques3Why Data Preprocessing?Data in the real world is dirty plete: lacking attribute values, lacking certai

3、n attributes of interest, or containing only aggregate datae.g., occupation=“”noisy: containing errors or outlierse.g., Salary=“-10”inconsistent: containing discrepancies in codes or namese.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between dupl

4、icate recordsOctober 3, 2022Data Mining: Concepts and Techniques4Why Is Data Dirty? plete data comes fromn/a data value when collecteddifferent consideration between the time when the data was collected and when it is analyzed.human/hardware/software problemsNoisy data comes from the process of data

5、collectionentrytransmissionInconsistent data comes fromDifferent data sourcesFunctional dependency violationOctober 3, 2022Data Mining: Concepts and Techniques5Why Is Data Preprocessing Important?No quality data, no quality mining results!Quality decisions must be based on quality datae.g., duplicat

6、e or missing data may cause incorrect or even misleading statistics.Data warehouse needs consistent integration of quality dataData extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. Bill Inmon October 3, 2022Data Mining: Concepts and Techniques

7、6Multi-Dimensional Measure of Data QualityA well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibilityBroad categories:intrinsic, contextual, representational, and accessibility.October 3, 2022Data Mining: Concepts and Technique

8、s7Major Tasks in Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesData integrationIntegration of multiple databases, data cubes, or filesData transformationNormalization and aggregationData reductionObtains reduced repr

9、esentation in volume but produces the same or similar analytical resultsData discretizationPart of data reduction but with particular importance, especially for numerical dataOctober 3, 2022Data Mining: Concepts and Techniques8Forms of data preprocessing October 3, 2022Data Mining: Concepts and Tech

10、niques9Chapter 3: Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryOctober 3, 2022Data Mining: Concepts and Techniques10Data CleaningImportance“Data cleaning is one of the three biggest prob

11、lems in data warehousing”Ralph Kimball“Data cleaning is the number one problem in data warehousing”DCI surveyData cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataResolve redundancy caused by data integrationOctober 3, 2022Data Mining: Concepts

12、 and Techniques11Missing DataData is not always availableE.g., many tuples have no recorded value for several attributes, such as customer e in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcert

13、ain data may not be considered important at the time of entrynot register history or changes of the dataMissing data may need to be inferred.October 3, 2022Data Mining: Concepts and Techniques12How to Handle Missing Data?Ignore the tuple: usually done when class label is missing (assuming the tasks

14、in classificationnot effective when the percentage of missing values per attribute varies considerably.Fill in the missing value manually: tedious + infeasible?Fill in it automatically witha global constant : e.g., “unknown”, a new class?! the attribute meanthe attribute mean for all samples belongi

15、ng to the same class: smarterthe most probable value: inference-based such as Bayesian formula or decision treeOctober 3, 2022Data Mining: Concepts and Techniques13Noisy DataNoise: random error or variance in a measured variableIncorrect attribute values may due tofaulty data collection instrumentsd

16、ata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention Other data problems which requires data cleaningduplicate records plete datainconsistent dataOctober 3, 2022Data Mining: Concepts and Techniques14How to Handle Noisy Data?Binning method:first sort dat

17、a and partition into (equi-depth) binsthen one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.Clusteringdetect and remove outliersCombined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)Regressionsmooth by fit

18、ting the data into regression functionsOctober 3, 2022Data Mining: Concepts and Techniques15Simple Discretization Methods: BinningEqual-width (distance) partitioning:Divides the range into N intervals of equal size: uniform gridif A and B are the lowest and highest values of the attribute, the width

19、 of intervals will be: W = (B A)/N.The most straightforward, but outliers may dominate presentationSkewed data is not handled well.Equal-depth (frequency) partitioning:Divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attrib

20、utes can be tricky.October 3, 2022Data Mining: Concepts and Techniques16Binning Methods for Data Smoothing* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothi

21、ng by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34October 3, 2022Data Mining: Concepts and Techniques17Cluster AnalysisOctober 3, 2022Data Mining: Concepts and Techniq

22、ues18Regressionxyy = x + 1X1Y1Y1October 3, 2022Data Mining: Concepts and Techniques19Chapter 3: Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryOctober 3, 2022Data Mining: Concepts and Tech

23、niques20Data IntegrationData integration: combines data from multiple sources into a coherent storeSchema integrationintegrate metadata from different sourcesEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#Detecting and resolving data v

24、alue conflictsfor the same real world entity, attribute values from different sources are differentpossible reasons: different representations, different scales, e.g., metric vs. British unitsOctober 3, 2022Data Mining: Concepts and Techniques21Handling Redundancy in Data IntegrationRedundant data o

25、ccur often when integration of multiple databasesThe same attribute may have different names in different databasesOne attribute may be a “derived” attribute in another table, e.g., annual revenueRedundant data may be able to be detected by correlational analysisCareful integration of the data from

26、multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and qualityOctober 3, 2022Data Mining: Concepts and Techniques22Data TransformationSmoothing: remove noise from dataAggregation: summarization, data cube constructionGeneralization: concept hierarchy cli

27、mbingNormalization: scaled to fall within a small, specified rangemin-max normalizationz-score normalizationnormalization by decimal scalingAttribute/feature constructionNew attributes constructed from the given onesOctober 3, 2022Data Mining: Concepts and Techniques23Data Transformation: Normalizat

28、ionmin-max normalizationz-score normalizationnormalization by decimal scalingWhere j is the smallest integer such that Max(| |)Reduced attribute set: A1, A4, A6October 3, 2022Data Mining: Concepts and Techniques29Heuristic Feature Selection MethodsThere are 2d possible sub-features of d featuresSeve

29、ral heuristic feature selection methods:Best single features under the feature independence assumption: choose by significance tests.Best step-wise feature selection: The best single-feature is picked firstThen next best feature condition to the first, .Step-wise feature elimination:Repeatedly elimi

30、nate the worst featureBest combined feature selection and elimination:Optimal branch and bound:Use feature elimination and backtrackingOctober 3, 2022Data Mining: Concepts and Techniques30Data CompressionString compressionThere are extensive theories and well-tuned algorithmsTypically losslessBut on

31、ly limited manipulation is possible without expansionAudio/video compressionTypically lossy compression, with progressive refinementSometimes small fragments of signal can be reconstructed without reconstructing the wholeTime sequence is not audioTypically short and vary slowly with timeOctober 3, 2

32、022Data Mining: Concepts and Techniques31Data CompressionOriginal DataCompressed DatalosslessOriginal DataApproximated lossyOctober 3, 2022Data Mining: Concepts and Techniques32Wavelet Transformation Discrete wavelet transform (DWT): linear signal processing, multiresolutional analysisCompressed app

33、roximation: store only a small fraction of the strongest of the wavelet coefficientsSimilar to discrete Fourier transform (DFT), but better lossy compression, localized in spaceMethod:Length, L, must be an integer power of 2 (padding with 0s, when necessary)Each transform has 2 functions: smoothing,

34、 differenceApplies to pairs of data, resulting in two set of data of length L/2Applies two functions recursively, until reaches the desired length Haar2Daubechie4October 3, 2022Data Mining: Concepts and Techniques33DWT for Image CompressionImage Low Pass High Pass Low Pass High PassLow Pass High Pas

35、sOctober 3, 2022Data Mining: Concepts and Techniques34Given N data vectors from k-dimensions, find c = k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is

36、a linear combination of the c principal component vectorsWorks for numeric data onlyUsed when the number of dimensions is largePrincipal Component Analysis October 3, 2022Data Mining: Concepts and Techniques35X1X2Y1Y2Principal Component AnalysisOctober 3, 2022Data Mining: Concepts and Techniques36Nu

37、merosity ReductionParametric methodsAssume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do

38、 not assume modelsMajor families: histograms, clustering, sampling October 3, 2022Data Mining: Concepts and Techniques37Regression and Log-Linear ModelsLinear regression: Data are modeled to fit a straight lineOften uses the least-square method to fit the lineMultiple regression: allows a response v

39、ariable Y to be modeled as a linear function of multidimensional feature vectorLog-linear model: approximates discrete multidimensional probability distributionsLinear regression: Y = + XTwo parameters , and specify the line and are to be estimated by using the data at hand.using the least squares c

40、riterion to the known values of Y1, Y2, , X1, X2, .Multiple regression: Y = b0 + b1 X1 + b2 X2.Many nonlinear functions can be transformed into the above.Log-linear models:The multi-way table of joint probabilities is approximated by a product of lower-order tables.Probability: p(a, b, c, d) = ab ac

41、ad bcdRegress Analysis and Log-Linear ModelsOctober 3, 2022Data Mining: Concepts and Techniques39HistogramsA popular data reduction techniqueDivide data into buckets and store average (sum) for each bucketCan be constructed optimally in one dimension using dynamic programmingRelated to quantization

42、problems.October 3, 2022Data Mining: Concepts and Techniques40ClusteringPartition data set into clusters, and one can store cluster representation onlyCan be very effective if data is clustered but not if data is “smeared”Can have hierarchical clustering and be stored in multi-dimensional index tree

43、 structuresThere are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8October 3, 2022Data Mining: Concepts and Techniques41SamplingAllow a mining algorithm to run in complexity that is potentially sub-linear to the size of the dataChoose a representative

44、 subset of the dataSimple random sampling may have very poor performance in the presence of skewDevelop adaptive sampling methodsStratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed dataSampling may not r

45、educe database I/Os (page at a time).October 3, 2022Data Mining: Concepts and Techniques42SamplingSRSWOR(simple random sample without replacement)SRSWRRaw DataOctober 3, 2022Data Mining: Concepts and Techniques43SamplingRaw Data Cluster/Stratified SampleOctober 3, 2022Data Mining: Concepts and Techn

46、iques44Hierarchical ReductionUse multi-resolution structure with different degrees of reductionHierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters”Parametric methods are usually not amenable to hierarchical representationHierarchical aggregati

47、on An index tree hierarchically divides a data set into partitions by value range of some attributesEach partition can be considered as a bucketThus an index tree with aggregates stored at each node is a hierarchical histogramOctober 3, 2022Data Mining: Concepts and Techniques45Chapter 3: Data Prepr

48、ocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryOctober 3, 2022Data Mining: Concepts and Techniques46DiscretizationThree types of attributes:Nominal values from an unordered setOrdinal values from an

49、 ordered setContinuous real numbersDiscretization: divide the range of a continuous attribute into intervalsSome classification algorithms only accept categorical attributes.Reduce data size by discretizationPrepare for further analysisOctober 3, 2022Data Mining: Concepts and Techniques47Discretizat

50、ion and Concept hierachyDiscretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data valuesConcept hierarchies reduce the data by collecting and replacing low level concepts

51、(such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)October 3, 2022Data Mining: Concepts and Techniques48Discretization and Concept Hierarchy Generation for Numeric DataBinning (see sections before)Histogram analysis (see sections before)Clu

52、stering analysis (see sections before)Entropy-based discretizationSegmentation by natural partitioningOctober 3, 2022Data Mining: Concepts and Techniques49Entropy-Based DiscretizationGiven a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partit

53、ioning isThe boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,Experiments show that it may reduce data size and improve classification

54、accuracyOctober 3, 2022Data Mining: Concepts and Techniques50Segmentation by Natural PartitioningA simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range

55、into 3 equi-width intervalsIf it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervalsIf it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervalsOctober 3, 2022Data Mining: Concepts and Techniques51Examp

56、le of 3-4-5 Rule(-$4000 -$5,000)(-$400 - 0)(-$400 - -$300)(-$300 - -$200)(-$200 - -$100)(-$100 - 0)(0 - $1,000)(0 - $200)($200 - $400)($400 - $600)($600 - $800)($800 - $1,000)($2,000 - $5, 000)($2,000 - $3,000)($3,000 - $4,000)($4,000 - $5,000)($1,000 - $2, 000)($1,000 - $1,200)($1,200 - $1,400)($1,

57、400 - $1,600)($1,600 - $1,800)($1,800 - $2,000) msd=1,000Low=-$1,000High=$2,000Step 2:Step 4:Step 1: -$351-$159profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Maxcount(-$1,000 - $2,000)(-$1,000 - 0)(0 -$ 1,000)Step 3:($1,000 - $2,000)October 3, 2022Data Mining: Concepts and Techniq

58、ues52Concept Hierarchy Generation for Categorical DataSpecification of a partial ordering of attributes explicitly at the schema level by users or expertsstreetcitystatecountrySpecification of a portion of a hierarchy by explicit data groupingUrbana, Champaign, ChicagoIllinoisSpecification of a set

59、of attributes. System automatically generates partial ordering by analysis of the number of distinct valuesE.g., street city state countrySpecification of only a partial set of attributesE.g., only street city, not othersOctober 3, 2022Data Mining: Concepts and Techniques53Automatic Concept Hierarchy GenerationSome concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set The attribute w

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論