生物信息學(xué)課件英文原版課件 (66)_第1頁(yè)
生物信息學(xué)課件英文原版課件 (66)_第2頁(yè)
生物信息學(xué)課件英文原版課件 (66)_第3頁(yè)
生物信息學(xué)課件英文原版課件 (66)_第4頁(yè)
生物信息學(xué)課件英文原版課件 (66)_第5頁(yè)
已閱讀5頁(yè),還剩51頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Copyright, 1996 Dale Carnegie & Associates, Inc.False Discovery Ratein Large Multiplicity Problems Yoav BenjaminiTel Aviv University math.tau.ac.il/ybenja16 Aug2001Organization of the talkMotivating ExamplesThe general thresholding problemFDR controlling procedures and their propertiesUse of FDR in

2、High Throughput ScreeningUse of FDR in Data MiningConcluding RemarksY BenjaminiBased on joint work withYosi HochbergDaniel YekutieliFelix AbramovichAnat ReinerDave DonohoIain JohnstoneAbba KriegerFrank BretzY BenjaminiMotivating ExamplesHigh throughput screening Of Chemical compoundsOf gene expressi

3、onData MiningMining of Association RulesModel SelectionY BenjaminiHigh throughput screening of Chemical CompoundsPurpose: at early stages of drug development, screen a large number of potential chemical compounds, in order to find any interaction with a given class of compounds (a hit )The classes m

4、ay be substructures of libraries of compounds involving up to 105 members. Each potential compound interaction with class member is tested once and only onceY BenjaminiHigh Throughput Screening with Microtitersplate ii=74i=1row jj=8j=1Negative controlPositive controlk=2k=1110 x8 Potential CompoundsY

5、 BenjaminiHigh Throughput Screening Step 1: Analyzing the negative control data74 plates x 8 rows Get comparison values per plate and s.e. Step 2: Conduct individual comparisons74 plates x 80 potential compoundsNote positive dependency within plate because ofY BenjaminiGene-expression micro-arraysEx

6、ample: Dudoit et al (2000):Statistical analysis of a lipid metabolism study in mice.Treatment: 8 low HDL level knockout miceControl: 8 inbred micePurpose: Identification of single differentially expressed genes in replicated cDNA microarray experiments.Y BenjaminiMicroarrays and their Statistical An

7、alysisThe microarray data consisted in this case of 6359 individual DNA sequences (out of 6384 printed in a high density array on a glass). Both treatment and control on a single chipThe ratio of the fluorescence intensity measured for each spot in the array is indicative of the relative abundance o

8、f the corresponding DNA sequence in the two nucleic acid samples.Data was suitably standardized using lowess smoother.A t-statistic is calculated for comparing the mean of each gene expression in the control and treatment groups. Y BenjaminiMicroarrays and MultiplicityNeglecting multiplicity issues,

9、 i.e. working at the individual 0.05 level, would identify, on the average, 6359*0.05=318 differentially expressed genes, even if really no such gene exists.Addressing multiplicity with Bonferroni at 0.05 identifies 8 . Y BenjaminiMining of association rules in Basket AnalysisA basket bought at the

10、food store consists of:(Apples, Bread,Coke,Milk,Tissues)Data on all baskets is available (through cash registers)The goal: Discover association rules of the formBread&Milk = Coke&TissueAlso called linkage analysis or item analysisY BenjaminiProperties of association rulesThe support of the rule is t

11、heProportion of baskets with Bread&Milk&Coke&TissueThe confidence of the rule is theSup (Bread&Milk&Coke&Tissue)/Sup(Bread&Milk)(simply the estimated conditional probability in statistical terms)The lift of the rule is theSup (B&M&C&T)/Sup(B&M)Sup(C&T)Search for rules with high confidence and suppor

12、tY BenjaminiMore on Association RulesWill the results be affected by randomness?Add the requirement that the rule is statistically significant in the test against independence (i.e. against lift=1)The number of such tests to be performed in a moderate problem reaches tens of thousandsY BenjaminiMode

13、l Selection Paralyzed veterans of AmericaMailing list of 3.5 M potential donors200K made their last donation 1-2 years agoIs there something better than mailing all 200K?If all mailed, net donation is $10,500Using data mining.Y BenjaminiY BenjaminiModel SelectionSome 300 variables to be considered f

14、or the model - more when transformations were consideredWhich variables should be included in the model?(Foster and Steins model personal bankruptcy using 200 original variables plus all 200 x200 variables capturing interactions)The winning performer GainSmart (by Yaacov Zehavi) used logistic regres

15、sion to model the prob. of response for each individualY BenjaminiModel Selection in large problemsknown approaches to model selectionAIC and Cp .05 in testing “forward selection or “backward eliminationThe Universal Threshold of Donoho and JohnstoneY BenjaminiOther examplesGenetics: mapping (Hsus t

16、alk, but instead of 63 markers 800-2000; instead of 1 gene a few; instead of 1 endpoint a few.) Behavioral geneticsFunctional MRIImage processing and wavelets analysisMultiple endpoints in medical studiesY BenjaminiWhats in common?Size of the problem: large to huge(m small n large ;m=n large; m larg

17、e n small)Question 1: Is there a real effect at a specific gene/site/location/association rule?Question 2: If there is an effect, of what size?Discoveries are further studied; negative results are usually ignored Results should be communicated compactly to a wide audienceA threshold is being used fo

18、r question 1.Y BenjaminiThe setting of a thresholdGene expression “practicalChemistry “no hypotheses testing, just look and considerFunctional MRI“practical adjusted to the individual at testAssociation RuleUnadjusted tests How should a threshold be chosen?Y BenjaminiSignificance testing as threshol

19、ding The problem is closest to classical significance testing, possibly followed by estimation. We should worry about multiplicity!What error-rate to control? Y BenjaminiChalengesControlling the FWE is too restrictiveThere is almost always at least one “real effectNot important to protect against ev

20、en a single errorWhy should a researcher be penalized for conducting a more informative study?Not controlling for multiplicity: 16 Aug2001“guidelines for interpreting Lander and Kruglyak 95“Adopting too lax a standard guarantees a burgeoning literature of false positive linkage claims, each with its

21、 own symbol Scientific disciplines erode their credibility when substantial proportion of claims cannot be replicated .i.e.when the False Discovery Rate was too high! They suggested control of FWE instead, but are ready to live with level .5 (half!), to overcome loss of power. 16 Aug2001So, we sugge

22、st,use FDR hypotheses testing to set the thresholdMultiplicity can no longer be ignoredNot by Frequentists nor by Bayesians Not because of skepticism, but because it is a better way to deal with uncertainty in large data setsto summarize the dataSee theoretical support later on.16 Aug2001Historical

23、perspectiveTukey, when expressing support for the use of FDR, points back to his own (1953) as the roots of the idea!(?) He clearly was looking over these years for some approach in between the too soft PCE and the too harsh FWE.16 Aug2001Next, how do we infer on the selected set ?Hypotheses testing

24、 followed by estimation (point and/or confidence intervals)In short “Testimation with confidence16 Aug2001How does it work? Does it makes sense?Before doing thatOne more comment about the FDR criterionTwo comments on the Linear StepUp procedureOther FDR controlling proceduresY BenjaminiThe comment a

25、bout FDR criterionWith all respect to the other TLAs we have seen these past days, FDR is a catchy name not because of our inventivenessY BenjaminiY BenjaminiAs a result:Genovese and Wasserman emphasize the sample quantity V/R Storey emphasizes E(V/R | R0) But both keep the term FDR for their versio

26、nsY Benjamini1. Properties of the Linear StepUp ProcedureIf the test statistics are :Independent YB&Yekutieli (01)independent and continuous YB&Hochberg (95)Positive dependent YB&Yekutieli (01)General YB&Yekutieli (01)Y BenjaminiPositive dependencyPositive Regression Dependency on the subset of true

27、 null hypotheses:If the test statistics are X=(X1,X2,Xm):For any increasing set D, and H0i trueProb( X in D | Xi=s ) is increasing in sImportant Examples Multivariate Normal with positive correlationAbsolute Studentized independent normal(Studentized PRDS distribution, for q.5) Y BenjaminiMore about

28、 dependencyIf the test statistics are :All Pairwise Comparisons: xi - xj i,j=1,2,keven though correlations between pairs of comparisons are both + and - Based on many simulation studies:Williams, Jones, & Tukey (94,99); YB, Hochberg, & Kling (94+) Kesselman, Cribbie, &Holland (99).And limited theore

29、tical evidence Yekutieli (99+)so the theoretical problem is still open.Y Benjamini2. ScalabilityThe procedure is stable as the size of the problem increases.The discoveries in the combined study are (about) the same as when analyzed separately.Y BenjaminiScalability (contd)For scalability to hold:Su

30、b-studies should be largeNot totally nullTheorem (Abramovich, YB, Donoho, & Johnston (98+):Using the linear step-up procedure to test L families of hypotheses separately,each family of size mi, and if in each family m0i hypotheses are true, m0i / mi approaching some c1 as mi increases to infinity,Y

31、Benjamini3. Adaptive procedures that control FDRRecall the m0/m factor of conservativenessHence: if m0 is known using linear step-up procedure with qi/ m(m/m0) = qi/ m0 controls the FDR at level q exactly.The adaptive procedure BY & Hochberg (00): Estimate m0 from the uniform q-q plot of the p-value

32、sThis is FDR controlling under independence (via simulations)Y BenjaminiThe two-staged procedure BY, Krieger, Yekutieli(00)Use the linear step-up at level q once and get r1. Estimate m0 (somewhat conservatively) by (m- r1)/(1-q)Use the linear step-up the second time at level q2= q(1-q)m/ (m- r1)The

33、FDR is proved to be controlled at level q in the independent caseThe FDR is conjectured to be controlled at level q for positive dependent test statistics (PRDS)Proof for m=3 Simulations for constant positive correlationsY BenjaminiNon-parametric step-down procedureBY &Liu (00+)Discussed by SarkarRe

34、sampling procedureYekutieli &BY (99)Demonstrated laterY BenjaminiOrganization of the talkMotivating ExamplesThe general thresholding problemFDR controlling procedures and their propertiesUse of FDR in High Throughput ScreeningUse of FDR in Data MiningConcluding RemarksY BenjaminiFDR screening of che

35、mical compoundsUniform q-q plot of test resultszooming into the smallest 150 p-values(largest 150 interactions)Applying multiple testing at level .05:FWE control 103 significant (using Bonferroni) FDR control 125 significant (using Linear StepUp) Jointly Separately 121 134Y BenjaminiFDR in Micro-arr

36、aysDudoit et al account for multiple testing, by using the Westfall and Young step-down resampling algorithm to calculate adjusted p-values while controlling the FWE. (avoiding t-distribution assumption and utilizing correlation) FDR considered (but not used) because of dependency This need not be a

37、 limitationY BenjaminiY BenjaminiFDR in High Throughput ScreeningParticular Remarks (I)Positive dependency does not harm in both examples, but has been utilized only in the analysis of Micro-arraysIn the chemical example there is constant positive dependency within plate.We plan to use new FDR contr

38、olling procedures for this setting (with F. Bretz)Y BenjaminiFDR in High Throughput ScreeningGeneral RemarksAn interpretation of FDR:expenses wasted chasing “red herrings expenses made on follow-up studies But FDR with 0.2 ?Exp()qY BenjaminiFDR in High Throughput ScreeningFDR with 0.2 ?Makes sense i

39、n screening experiments which are followed by an independent studySecond study can be conducted on the set of identified genes, (FWE) controlling for multiplicity at, say, .05 / 0.2 =.25 (!). still the overall (FWE) level is .05.Y BenjaminiInference on the selected set:testimation with confidenceTes

40、t using linear step-up procedurep(k) qk/mEstimate usingXkFDR =0 if | Xk | infinityIf prop( non-zero coefficients) - 0, Or If size of sorted coefficients decays fast, (while the others need not be exactly 0). THEN thresholding by FDR testing of the coefficients is adaptively minimax over bodies of sp

41、arse signals Where performance measured by any loss 0 infinityIf prop( non-zero coefficients) - 0, as before Abramovich, YB, Donoho, & Johnstone (00+)Under non orthogonal regression? Non linear? Non Normal?What about q? We know q should be 0 slowly (as required in current proof?)Many open problems,

42、but the direction is clear:Y BenjaminiModel Selection and FDR - Practical TheoryThe theory is being developed for the minimizer of the following penalized Sum of Squared Residuals:The Linear Step-Up is Essentially “backwards elimination (and close to “forward selection) with the above penalty functi

43、on :AICY BenjaminiModel Selection and FDRReiner (00+) studied (via simulations) the testing of up to 128 regression coefficients in a logistic regression. The linear step-up procedure to offer FDR control, and higher power to discover “real terms, even in face of correlation Nevertheless classificat

44、ion error was not assessedFoster and Stein studied linear model regression selection problem using a penalty function which is closely related to FDR. Y BenjaminiMining of association rules via FDRZembovich &Zytkov (97) developed the 49er software to mine association rules using chi-square tests of significance for the independence assumptionThey find that usually “too many of the m rules are signif

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論