![Lecture5缺失值處理策略課件_第1頁(yè)](http://file4.renrendoc.com/view/ce83f31f5bed4d888793e06f251240f6/ce83f31f5bed4d888793e06f251240f61.gif)
![Lecture5缺失值處理策略課件_第2頁(yè)](http://file4.renrendoc.com/view/ce83f31f5bed4d888793e06f251240f6/ce83f31f5bed4d888793e06f251240f62.gif)
![Lecture5缺失值處理策略課件_第3頁(yè)](http://file4.renrendoc.com/view/ce83f31f5bed4d888793e06f251240f6/ce83f31f5bed4d888793e06f251240f63.gif)
![Lecture5缺失值處理策略課件_第4頁(yè)](http://file4.renrendoc.com/view/ce83f31f5bed4d888793e06f251240f6/ce83f31f5bed4d888793e06f251240f64.gif)
![Lecture5缺失值處理策略課件_第5頁(yè)](http://file4.renrendoc.com/view/ce83f31f5bed4d888793e06f251240f6/ce83f31f5bed4d888793e06f251240f65.gif)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Outline of the problemMissing values in longitudinal trials is a big issueFirst aim should be to reduce proportionEthics dictate that it cant be avoidedThere is no magic method to fix itMagnitude of problem varies across areas8-week depression trial: 25%50% may drop out by final visit12-week asthma
2、trial: maybe only 5%10%1 Outline of the problemMissing DateName, department2 Outline of the lecturePart I: Missing dataPart II: Multiple imputationDateName, department2 Outline Example: The analgesic trial3 Example: The analgesic trial3 4 4 DateName, department5 Part I: Missing dataIn real datasets,
3、 like, e.g., surveys and clinical trials, it is quite common to have observations with missing values for one or more input features. The first issue in dealing with the problem is determining whether the missing data mechanism has distorted the observed data.Little and Rubin (1987) and Rubin (1987)
4、 distinguish between basically three missing data mechanisms. Data are said to be missing at random (MAR) if the mechanism resulting in its omission is independent of its (unobserved) value. If its omission is also independent of the observed values, then the missingness process is said to be missin
5、g completely at random (MCAR). In any other case the process is missing not at random (MNAR), i.e., the missingness process depends on the unobserved values. http:/www.emea.europa.eu/pdfs/human/ewp/177699EN.pdfDateName, department5 Part I: 1. Introduction to missing data?Variables Cases? = missing6
6、1. Introduction to missing datWhat is missing data?The missingness hides a real value that is useful for analysis purposes.Survey questions:What is your total annual income for FY 2008? Who are you voting for in the 2009 election for the European parlament?7 What is missing data?The missiWhat is mis
7、sing data?Clinical trials:StartFinishcensored at this point in timetime8 What is missing data?Clinical MissingnessIt matters why data are missing. Suppose you are modelling weight (Y) as a function of sex (X). Some respondents wouldnt disclose their weight, so you are missing some values for Y. Ther
8、e are three possible mechanisms for the nondisclosure: There may be no particular reason why some respondents told you their weights and others didnt. That is, the probability that Y is missing may has no relationship to X or Y. In this case our data is missing completely at randomOne sex may be les
9、s likely to disclose its weight. That is, the probability that Y is missing depends only on the value of X. Such data are missing at randomHeavy (or light) people may be less likely to disclose their weight. That is, the probability that Y is missing depends on the unobserved value of Y itself. Such
10、 data are not missing at random9 MissingnessIt matters why dataMissing data patterns & mechanisms Pattern: Which values are missing? Mechanism: Is missingness related to the response?(Yi , Ri ) = Data matrix, with COMPLETE DATARij =1, Yij missing0, Yij observedRij = Missing data indicator matrix= Ob
11、served part of Y= Missing part of Y10 Missing data patterns & mechanMissing data patterns & mechanisms“Pattern” concerns the distribution of R“Mechanism” concerns the distribution of R given YRubin (Biometrika 1976) distinguishes between: Missing Completely at Random (MCAR)P(R|Y) = P(R) for all Y Mi
12、ssing at Random (MAR)P(R|Y) = P(R| ) for all Not Missing at Random (NMAR)P(R|Y) depends on11 Missing data patterns & mechanMissing At Random (MAR) What are the most general conditions under which a valid analysis can be done using only the observed data, and no information about the missingness valu
13、e mechanism, The answer to this is when, given the observed data, the missingness mechanism does not depend on the unobserved data. Mathematically, This is termed Missing At Random, and is equivalent to saying that the behaviour of two units who share observed values have the same statistical behavi
14、our on the other observations, whether observed or not. 12 Missing At Random (MAR) What aAs units 1 and 2 have the same values where both are observed, given these observed values, under MAR, variables 3, 5 and 6 from unit 2 have the same distribution (NB not the same value!) as variables 3, 5 and 6
15、 from unit 1. Note that under MAR the probability of a value being missing will generally depend on observed values, so it does not correspond to the intuitive notion of random. The important idea is that the missing value mechanism can be expressed solely in terms of observations that are observed.
16、 Unfortunately, this can rarely be definitively determined from the data at hand!Example13 As units 1 and 2 have the sameIf data are MCAR or MAR, you can ignore the missing data mechanism and use multiple imputation and maximum likelihood. If data are NMAR, you cant ignore the missing data mechanism
17、; two approaches to NMAR data are selection models and pattern mixture. 14 14 Suppose Y is weight in pounds; if someone has a heavy weight, they may be less inclined to report it. So the value of Y affects whether Y is missing; the data are NMAR. Two possible approaches for such data are selection m
18、odels and pattern mixture. Selection models. In a selection model, you simultaneously model Y and the probability that Y is missing. Unfortunately, a number of practical difficulties are often encountered in estimating selection models. Pattern mixture (Rubin 1987). When data is NMAR, an alternative
19、 to selection models is multiple imputation with pattern mixture. In this approach, you perform multiple imputations under a variety of assumptions about the missing data mechanism. In ordinary multiple imputation, you assume that those people who report their weights are similar to those who dont.
20、In a pattern-mixture model, you may assume that people who dont report their weights are an average of 20 pounds heavier. This is of course an arbitrary assumption; the idea of pattern mixture is to try out a variety of plausible assumptions and see how much they affect your results. Pattern mixture
21、 is a more natural, flexible, and interpretable approach. 15 Suppose Y is weight in pounds;Simple analysis strategies(1) Complete Case (CC) analysisAdvantages:Complete Cases?discardEasy Does not invent dataDisadvantages:InefficientDiscarding data is badCC are often biased samplesWhen some variables
22、are not observed for some of the units, one can omit these units from the analysis. These so-called “complete cases”are then analyzed as they are. 16 Simple analysis strategies(1) Analysis strategies(2) Analyze as incomplete (summary measures, GEE, )Advantages: Complete Cases?Advantages:Does not inv
23、ent dataDisadvantagesRestricted in what you can inferMaximum likelihood methods may be computationally intensive or not feasible for certain types of models.17 Analysis strategies(2) AnalyzeAnalysis strategies(3) Analysis after single imputationAdvantages:Complete Cases = imputationRectangular fileG
24、ood for multiple usersDisadvantages:Nave imputations not goodInvents data- inference is distorted by treating imputations as the truth18 Analysis strategies(3) AnalysiSimple methods of analysis of incomplete datacclocf19 Simple methods of analysis of Various strategies20 Various strategies20 Notatio
25、nDROPOUT21 NotationDROPOUT21 IgnorabilityIn a likelihood setting the term ignorable is often used to refer to MAR mechanism. It is the mechanism which is ignorable - not the missing data! 22 IgnorabilityIn a likelihood sIgnorability23 Ignorability23 Direct likelihood maximisation24 Direct likelihood
26、 maximisationExample 1: Growth data25 Example 1: Growth data25 26 26 Growth data27 Growth data27 28 28 Example: The depression trialPatients are evaluated both pretreatment and posttreatment with the 17-item Hamilton Rating Scale for Depression (Ham-D-17),29 Example: The depression trialPThe depress
27、ion trial30 The depression trial30 31 31 5. Part II: Multiple imputation32 5. Part II: Multiple imputatioData set withmissing valuesResultCompleted set33 Data set withResultCompleted s34 34 General principles35 General principles35 Informal justification36 Informal justification36 The algorithm37 Th
28、e algorithm37 Pooling information38 Pooling information38 Hypothesis testing39 Hypothesis testing39 40 40 MI in practice41 MI in practice41 MI in practiceA simulation-based approach to missing data1. Generate M 1 plausible versions of .Complete Cases = imputation for Mth dataset2. Analyze each of th
29、e M datasets by standard complete-data methods.3. Combine the results across the M datasets (M =3-5 is usually OK).42 MI in practiceA simulation-basMI in practice. Step 1Generate M 1 plausible versions of via software, i.e. obtain M different datasets. An assumption we make: the data are MCAR or MAR
30、, i.e. the missing data mechanism is ignorable. Should use as much information is available in order to achieve the best imputation. If the percentage of missing data is high, we need to increase M.43 MI in practice. Step 1GeneraHow many datasets to create?The efficiency of an estimator based on M i
31、mputations is , where is the fraction of missing information.Efficiency of multiple imputation (%) M 0.1 0.3 0.5 0.7 0.93979186817759894918885109997959392201009998979644 How many datasets to create?ThMI in practice. Step 2Analyze each of the M datasets by standard complete-data methods. Let b be the
32、 parameter of interest. is the estimate of b from the complete-data analysis of the mth dataset. (m = 1 M) is the variance of from the analysis of the mth dataset.45 MI in practice. Step 2AnalyzMI in practice. Step 3Combine the results across the M datasets. is the combined inference for b. Variance
33、 for isbetweenwithin46 MI in practice. Step 3CombinSoftware1. Joe Schafers software from his web site. ($0)/%7Ejls/misoftwa.htmlSchafer has written publicly available software primarily for S-plus. There is a stand-alone Windows package for data that is multivariate normal.This web site contains muc
34、h useful information regarding multiple imputation.47 Software1. Joe Schafers softwSoftware2. SAS software (experimental)It is part of SAS/STAT version 8.02SAS institute paper on multiple imputation, gives an example and SAS code:/rnd/app/papers/multipleimputation.pdfSAS documentation on PROC MI/rnd
35、/app/papers/miv802.pdfSAS documentation on PROC MIANALYZE/rnd/app/papers/mianalyzev802.pdf48 Software2. SAS software (experSoftware3. SOLAS version 3.0 ($1K)http:/www.statsol.ie/index.php?pageID=5Windows based software that performs different types of imputation: Hot-deck imputation Predictive OLS/d
36、iscriminant regression Nonparametric based on propensity scores Last value carried forwardWill also combine parameter results across the M analyses.49 Software3. SOLAS version 3.0 (MI Analysis of the Orthodontic Growth Data50 MI Analysis of the OrthodonticProperties of methodsMCAR: drop-out independ
37、ent of responseCC is valid, though it ignores informationLOCF is valid if there are no trends with timeMAR: drop-out depends only on observationsCC, LOCF, GEE invalidMI, MNLM, weighted GEE validMNAR: drop-out depends also on unobservedCC, LOCF, GEE, MI, MNLM invalidSM, PMM valid if (uncheckable) ass
38、umptions true51 Properties of methodsMCAR: droReferencesAllison, P. (2002). Missing data. Thousand Oaks, CA: Sage greenback. Horton, NJ & Lipsitz, SR. (2001) Multiple imputation in practice: Comparison of software packages for regression models with missing variables. The American Statistician 55(3): 244-254. Little, R.J.A. (1992) Regression with missing Xs: A review. Journal of the American Statistical Association 87(420):1227-1237.Roderick J. A. Little and Donald B. Rubin (2002) Statistical Analysis with Missing Data, 2nd editio
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 保安臨時(shí)工勞動(dòng)合同年
- 廣告公司設(shè)計(jì)合同
- 賓館經(jīng)營(yíng)權(quán)質(zhì)押合同
- 內(nèi)蒙古汽車租賃合同
- 三農(nóng)服務(wù)智能化平臺(tái)構(gòu)建方案
- 藥物研發(fā)委托服務(wù)協(xié)議
- 三農(nóng)政策支持措施落實(shí)方案
- 內(nèi)墻抹灰班組勞務(wù)分包合同
- 農(nóng)業(yè)生產(chǎn)信用制度完善方案
- 基于人工智能的工業(yè)自動(dòng)化應(yīng)用實(shí)踐指導(dǎo)書
- 高中生物 人教版 選修二《生態(tài)系統(tǒng)及其穩(wěn)定性》 《生態(tài)系統(tǒng)及其穩(wěn)定性》單元教學(xué)設(shè)計(jì)
- GB/T 21260-2007汽車用前照燈清洗器
- 兒科重癥監(jiān)護(hù)病房管理演示文稿
- 九年級(jí)班主任開學(xué)第一課設(shè)計(jì)課件
- 建設(shè)工程項(xiàng)目管理課程-課件
- 甲基異丁基甲酮化學(xué)品安全技術(shù)說明書
- SURPAC軟件地質(zhì)建模操作步驟
- 秘書實(shí)務(wù)完整版課件全套ppt教程
- 新版神經(jīng)系統(tǒng)疾病的病史采集和體格檢查ppt
- 義務(wù)教育《歷史》課程標(biāo)準(zhǔn)(2022年版)
- 螺栓扭緊力矩表
評(píng)論
0/150
提交評(píng)論