版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Copyright, 1996 Dale Carnegie & Associates, Inc.False Discovery Ratein Large Multiplicity Problems Yoav BenjaminiTel Aviv University math.tau.ac.il/ybenja16 Aug2001Organization of the talkMotivating ExamplesThe general thresholding problemFDR controlling procedures and their propertiesUse of FDR in
2、High Throughput ScreeningUse of FDR in Data MiningConcluding RemarksY BenjaminiBased on joint work withYosi HochbergDaniel YekutieliFelix AbramovichAnat ReinerDave DonohoIain JohnstoneAbba KriegerFrank BretzY BenjaminiMotivating ExamplesHigh throughput screening Of Chemical compoundsOf gene expressi
3、onData MiningMining of Association RulesModel SelectionY BenjaminiHigh throughput screening of Chemical CompoundsPurpose: at early stages of drug development, screen a large number of potential chemical compounds, in order to find any interaction with a given class of compounds (a hit )The classes m
4、ay be substructures of libraries of compounds involving up to 105 members. Each potential compound interaction with class member is tested once and only onceY BenjaminiHigh Throughput Screening with Microtitersplate ii=74i=1row jj=8j=1Negative controlPositive controlk=2k=1110 x8 Potential CompoundsY
5、 BenjaminiHigh Throughput Screening Step 1: Analyzing the negative control data74 plates x 8 rows Get comparison values per plate and s.e. Step 2: Conduct individual comparisons74 plates x 80 potential compoundsNote positive dependency within plate because ofY BenjaminiGene-expression micro-arraysEx
6、ample: Dudoit et al (2000):Statistical analysis of a lipid metabolism study in mice.Treatment: 8 low HDL level knockout miceControl: 8 inbred micePurpose: Identification of single differentially expressed genes in replicated cDNA microarray experiments.Y BenjaminiMicroarrays and their Statistical An
7、alysisThe microarray data consisted in this case of 6359 individual DNA sequences (out of 6384 printed in a high density array on a glass). Both treatment and control on a single chipThe ratio of the fluorescence intensity measured for each spot in the array is indicative of the relative abundance o
8、f the corresponding DNA sequence in the two nucleic acid samples.Data was suitably standardized using lowess smoother.A t-statistic is calculated for comparing the mean of each gene expression in the control and treatment groups. Y BenjaminiMicroarrays and MultiplicityNeglecting multiplicity issues,
9、 i.e. working at the individual 0.05 level, would identify, on the average, 6359*0.05=318 differentially expressed genes, even if really no such gene exists.Addressing multiplicity with Bonferroni at 0.05 identifies 8 . Y BenjaminiMining of association rules in Basket AnalysisA basket bought at the
10、food store consists of:(Apples, Bread,Coke,Milk,Tissues)Data on all baskets is available (through cash registers)The goal: Discover association rules of the formBread&Milk = Coke&TissueAlso called linkage analysis or item analysisY BenjaminiProperties of association rulesThe support of the rule is t
11、heProportion of baskets with Bread&Milk&Coke&TissueThe confidence of the rule is theSup (Bread&Milk&Coke&Tissue)/Sup(Bread&Milk)(simply the estimated conditional probability in statistical terms)The lift of the rule is theSup (B&M&C&T)/Sup(B&M)Sup(C&T)Search for rules with high confidence and suppor
12、tY BenjaminiMore on Association RulesWill the results be affected by randomness?Add the requirement that the rule is statistically significant in the test against independence (i.e. against lift=1)The number of such tests to be performed in a moderate problem reaches tens of thousandsY BenjaminiMode
13、l Selection Paralyzed veterans of AmericaMailing list of 3.5 M potential donors200K made their last donation 1-2 years agoIs there something better than mailing all 200K?If all mailed, net donation is $10,500Using data mining.Y BenjaminiY BenjaminiModel SelectionSome 300 variables to be considered f
14、or the model - more when transformations were consideredWhich variables should be included in the model?(Foster and Steins model personal bankruptcy using 200 original variables plus all 200 x200 variables capturing interactions)The winning performer GainSmart (by Yaacov Zehavi) used logistic regres
15、sion to model the prob. of response for each individualY BenjaminiModel Selection in large problemsknown approaches to model selectionAIC and Cp .05 in testing “forward selection or “backward eliminationThe Universal Threshold of Donoho and JohnstoneY BenjaminiOther examplesGenetics: mapping (Hsus t
16、alk, but instead of 63 markers 800-2000; instead of 1 gene a few; instead of 1 endpoint a few.) Behavioral geneticsFunctional MRIImage processing and wavelets analysisMultiple endpoints in medical studiesY BenjaminiWhats in common?Size of the problem: large to huge(m small n large ;m=n large; m larg
17、e n small)Question 1: Is there a real effect at a specific gene/site/location/association rule?Question 2: If there is an effect, of what size?Discoveries are further studied; negative results are usually ignored Results should be communicated compactly to a wide audienceA threshold is being used fo
18、r question 1.Y BenjaminiThe setting of a thresholdGene expression “practicalChemistry “no hypotheses testing, just look and considerFunctional MRI“practical adjusted to the individual at testAssociation RuleUnadjusted tests How should a threshold be chosen?Y BenjaminiSignificance testing as threshol
19、ding The problem is closest to classical significance testing, possibly followed by estimation. We should worry about multiplicity!What error-rate to control? Y BenjaminiChalengesControlling the FWE is too restrictiveThere is almost always at least one “real effectNot important to protect against ev
20、en a single errorWhy should a researcher be penalized for conducting a more informative study?Not controlling for multiplicity: 16 Aug2001“guidelines for interpreting Lander and Kruglyak 95“Adopting too lax a standard guarantees a burgeoning literature of false positive linkage claims, each with its
21、 own symbol Scientific disciplines erode their credibility when substantial proportion of claims cannot be replicated .i.e.when the False Discovery Rate was too high! They suggested control of FWE instead, but are ready to live with level .5 (half!), to overcome loss of power. 16 Aug2001So, we sugge
22、st,use FDR hypotheses testing to set the thresholdMultiplicity can no longer be ignoredNot by Frequentists nor by Bayesians Not because of skepticism, but because it is a better way to deal with uncertainty in large data setsto summarize the dataSee theoretical support later on.16 Aug2001Historical
23、perspectiveTukey, when expressing support for the use of FDR, points back to his own (1953) as the roots of the idea!(?) He clearly was looking over these years for some approach in between the too soft PCE and the too harsh FWE.16 Aug2001Next, how do we infer on the selected set ?Hypotheses testing
24、 followed by estimation (point and/or confidence intervals)In short “Testimation with confidence16 Aug2001How does it work? Does it makes sense?Before doing thatOne more comment about the FDR criterionTwo comments on the Linear StepUp procedureOther FDR controlling proceduresY BenjaminiThe comment a
25、bout FDR criterionWith all respect to the other TLAs we have seen these past days, FDR is a catchy name not because of our inventivenessY BenjaminiY BenjaminiAs a result:Genovese and Wasserman emphasize the sample quantity V/R Storey emphasizes E(V/R | R0) But both keep the term FDR for their versio
26、nsY Benjamini1. Properties of the Linear StepUp ProcedureIf the test statistics are :Independent YB&Yekutieli (01)independent and continuous YB&Hochberg (95)Positive dependent YB&Yekutieli (01)General YB&Yekutieli (01)Y BenjaminiPositive dependencyPositive Regression Dependency on the subset of true
27、 null hypotheses:If the test statistics are X=(X1,X2,Xm):For any increasing set D, and H0i trueProb( X in D | Xi=s ) is increasing in sImportant Examples Multivariate Normal with positive correlationAbsolute Studentized independent normal(Studentized PRDS distribution, for q.5) Y BenjaminiMore about
28、 dependencyIf the test statistics are :All Pairwise Comparisons: xi - xj i,j=1,2,keven though correlations between pairs of comparisons are both + and - Based on many simulation studies:Williams, Jones, & Tukey (94,99); YB, Hochberg, & Kling (94+) Kesselman, Cribbie, &Holland (99).And limited theore
29、tical evidence Yekutieli (99+)so the theoretical problem is still open.Y Benjamini2. ScalabilityThe procedure is stable as the size of the problem increases.The discoveries in the combined study are (about) the same as when analyzed separately.Y BenjaminiScalability (contd)For scalability to hold:Su
30、b-studies should be largeNot totally nullTheorem (Abramovich, YB, Donoho, & Johnston (98+):Using the linear step-up procedure to test L families of hypotheses separately,each family of size mi, and if in each family m0i hypotheses are true, m0i / mi approaching some c1 as mi increases to infinity,Y
31、Benjamini3. Adaptive procedures that control FDRRecall the m0/m factor of conservativenessHence: if m0 is known using linear step-up procedure with qi/ m(m/m0) = qi/ m0 controls the FDR at level q exactly.The adaptive procedure BY & Hochberg (00): Estimate m0 from the uniform q-q plot of the p-value
32、sThis is FDR controlling under independence (via simulations)Y BenjaminiThe two-staged procedure BY, Krieger, Yekutieli(00)Use the linear step-up at level q once and get r1. Estimate m0 (somewhat conservatively) by (m- r1)/(1-q)Use the linear step-up the second time at level q2= q(1-q)m/ (m- r1)The
33、FDR is proved to be controlled at level q in the independent caseThe FDR is conjectured to be controlled at level q for positive dependent test statistics (PRDS)Proof for m=3 Simulations for constant positive correlationsY BenjaminiNon-parametric step-down procedureBY &Liu (00+)Discussed by SarkarRe
34、sampling procedureYekutieli &BY (99)Demonstrated laterY BenjaminiOrganization of the talkMotivating ExamplesThe general thresholding problemFDR controlling procedures and their propertiesUse of FDR in High Throughput ScreeningUse of FDR in Data MiningConcluding RemarksY BenjaminiFDR screening of che
35、mical compoundsUniform q-q plot of test resultszooming into the smallest 150 p-values(largest 150 interactions)Applying multiple testing at level .05:FWE control 103 significant (using Bonferroni) FDR control 125 significant (using Linear StepUp) Jointly Separately 121 134Y BenjaminiFDR in Micro-arr
36、aysDudoit et al account for multiple testing, by using the Westfall and Young step-down resampling algorithm to calculate adjusted p-values while controlling the FWE. (avoiding t-distribution assumption and utilizing correlation) FDR considered (but not used) because of dependency This need not be a
37、 limitationY BenjaminiY BenjaminiFDR in High Throughput ScreeningParticular Remarks (I)Positive dependency does not harm in both examples, but has been utilized only in the analysis of Micro-arraysIn the chemical example there is constant positive dependency within plate.We plan to use new FDR contr
38、olling procedures for this setting (with F. Bretz)Y BenjaminiFDR in High Throughput ScreeningGeneral RemarksAn interpretation of FDR:expenses wasted chasing “red herrings expenses made on follow-up studies But FDR with 0.2 ?Exp()qY BenjaminiFDR in High Throughput ScreeningFDR with 0.2 ?Makes sense i
39、n screening experiments which are followed by an independent studySecond study can be conducted on the set of identified genes, (FWE) controlling for multiplicity at, say, .05 / 0.2 =.25 (!). still the overall (FWE) level is .05.Y BenjaminiInference on the selected set:testimation with confidenceTes
40、t using linear step-up procedurep(k) qk/mEstimate usingXkFDR =0 if | Xk | infinityIf prop( non-zero coefficients) - 0, Or If size of sorted coefficients decays fast, (while the others need not be exactly 0). THEN thresholding by FDR testing of the coefficients is adaptively minimax over bodies of sp
41、arse signals Where performance measured by any loss 0 infinityIf prop( non-zero coefficients) - 0, as before Abramovich, YB, Donoho, & Johnstone (00+)Under non orthogonal regression? Non linear? Non Normal?What about q? We know q should be 0 slowly (as required in current proof?)Many open problems,
42、but the direction is clear:Y BenjaminiModel Selection and FDR - Practical TheoryThe theory is being developed for the minimizer of the following penalized Sum of Squared Residuals:The Linear Step-Up is Essentially “backwards elimination (and close to “forward selection) with the above penalty functi
43、on :AICY BenjaminiModel Selection and FDRReiner (00+) studied (via simulations) the testing of up to 128 regression coefficients in a logistic regression. The linear step-up procedure to offer FDR control, and higher power to discover “real terms, even in face of correlation Nevertheless classificat
44、ion error was not assessedFoster and Stein studied linear model regression selection problem using a penalty function which is closely related to FDR. Y BenjaminiMining of association rules via FDRZembovich &Zytkov (97) developed the 49er software to mine association rules using chi-square tests of significance for the independence assumptionThey find that usually “too many of the m rules are signif
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024版公司試用合同書(shū)
- 2025不銹鋼電梯裝飾與維護(hù)采購(gòu)合同3篇
- 2024砸墻施工協(xié)議合同書(shū)范本
- 2024年非商業(yè)性質(zhì)場(chǎng)地?zé)o償使用合同范本一
- 第十四章 生殖系統(tǒng)課件
- 2025年浙江麗水市遂昌縣經(jīng)濟(jì)投資發(fā)展集團(tuán)有限公司招聘筆試參考題庫(kù)附帶答案詳解
- 全國(guó)公開(kāi)課一等獎(jiǎng)統(tǒng)編版七年級(jí)語(yǔ)文上冊(cè)新教材(統(tǒng)編2024版)《梅嶺三章》精美課件
- 2025年銅陵交投石化有限公司招聘筆試參考題庫(kù)含答案解析
- 2025年甘肅能源化工金昌公司招聘筆試參考題庫(kù)含答案解析
- 機(jī)場(chǎng)地源熱泵系統(tǒng)施工合同
- 綿陽(yáng)市高中2022級(jí)(2025屆)高三第二次診斷性考試(二診)歷史試卷(含答案)
- 2025版工業(yè)制造工程墊資建設(shè)合同2篇
- 2025南方財(cái)經(jīng)全媒體集團(tuán)校園招聘63人高頻重點(diǎn)提升(共500題)附帶答案詳解
- ISO 56001-2024《創(chuàng)新管理體系-要求》專業(yè)解讀與應(yīng)用實(shí)踐指導(dǎo)材料之4:4組織環(huán)境-4.2理解相關(guān)方的需求和期望(雷澤佳編制-2025B0)
- 社工人才培訓(xùn)計(jì)劃實(shí)施方案
- 期末測(cè)試卷(試題)-2024-2025學(xué)年四年級(jí)上冊(cè)數(shù)學(xué)滬教版
- 外貿(mào)中常見(jiàn)付款方式的英文表達(dá)及簡(jiǎn)要說(shuō)明
- 抗壓偏壓混凝土柱承載力計(jì)算表格
- 初次申領(lǐng)《南京市建筑業(yè)企業(yè)信用管理手冊(cè)(電子版)》辦事
- 中國(guó)移動(dòng)呼叫中心運(yùn)營(yíng)管理指標(biāo)體系
- 會(huì)計(jì)職業(yè)道德案例分析PPT
評(píng)論
0/150
提交評(píng)論