北大暑期課程回歸分析_第1頁
北大暑期課程回歸分析_第2頁
北大暑期課程回歸分析_第3頁
北大暑期課程回歸分析_第4頁
北大暑期課程回歸分析_第5頁
已閱讀5頁,還剩4頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、Class 5: ANOVA (Analysis of Variance) and F-tests I. What is ANOVAWhat is ANOVA? ANOVA is the short name for the Analysis of Variance. The essence of ANOVA is to decompose the total variance of the dependent variable into two additive components, one for the structural part, and the other for the st

2、ochastic part, of a regression. Today we are going to examine the easiest case. II. ANOVA: An IntroductionLet the model be.Assuming x is a column vector (of length p) of independent variable values for the ith' observation, .Then is the predicted value. sum of squares total: because . This is al

3、ways true by OLS.= SSE + SSRImportant: the total variance of the dependent variable is decomposed into two additive parts: SSE, which is due to errors, and SSR, which is due to regression.Geometric interpretation: blackboard Decomposition of VarianceIf we treat X as a random variable, we can decompo

4、se total variance to the between-group portion and the within-group portion in any population:Prove: (by the assumption that , for all possible k.)The ANOVA table is to estimate the three quantities of equation (1) from the sample. As the sample size gets larger and larger, the ANOVA table will appr

5、oach the equation closer and closer. In a sample, decomposition of estimated variance is not strictly true. We thus need to separately decompose sums of squares and degrees of freedom. Is ANOVA a misnomer? III. ANOVA in MatrixI will try to give a simplied representation of ANOVA as follows: (because

6、 ) (in your textbook, monster look)SSE = e'e (because , as always) (in your textbook, monster look)IV. ANOVA TableSOURCESSDFMSFwithRegressionSSRDF(R)MSRMSR/MSEDF(R)ErrorSSEDF(E)MSEDF(E)TotalSSTDF(T)Let us use a real example. Assume that we have a regression estimated to be y = - 1.70 + 0.840 xAN

7、OVA TableSOURCESSDFMSFwithRegression6.4416.446.44/0.19=33.891, 18Error3.40180.19Total9.8419We know , , , , . If we know that DF for SST=19, what is n?n= 20 = 201.71.7+0.840.84509.12-21.70.84100- 125.0 = 6.44SSE = SST-SSR=9.84-6.44=3.40DF (Degrees of freedom): demonstration. Note: discounting the int

8、ercept when calculating SST.MS = SS/DFp = 0.000 ask students. What does the p-value say? V. F-TestsF-tests are more general than t-tests, t-tests can be seen as a special case of F-tests. If you have difficulty with F-tests, please ask your GSIs to review F-tests in the lab. F-tests takes the form o

9、f a fraction of two MS's.An F statistic has two degrees of freedom associated with it: the degree of freedom in the numerator, and the degree of freedom in the denominator. An F statistic is usually larger than 1. The interpretation of an F statistics is that whether the explained variance by th

10、e alternative hypothesis is due to chance. In other words, the null hypothesis is that the explained variance is due to chance, or all the coefficients are zero. The larger an F-statistic, the more likely that the null hypothesis is not true. There is a table in the back of your book from which you

11、can find exact probability values.In our example, the F is 34, which is highly significant.VI. R2R2 = SSR / SSTThe proportion of variance explained by the model.In our example, R-sq = 65.4% VII. What happens if we increase more independent variables.1. SST stays the same.2. SSR always increases.3. S

12、SE always decreases.4. R2 always increases.5. MSR usually increases.6. MSE usually decreases.7. F-test usually increases.Exceptions to 5 and 7: irrelevant variables may not explain the variance but take up degrees of freedom. We really need to look at the results. VIII. Important: General Ways of Hy

13、pothesis Testing with F-Statistics.All tests in linear regression can be performed with F-test statistics. The trick is to run "nested models."Two models are nested if the independent variables in one model are a subset or linear combinations of a subset (子集)of the independent variables in

14、 the other model. That is to say. If model A has independent variables (1, , ), and model B has independent variables (1, , ,), A and B are nested. A is called the restricted model; B is called less restricted or unrestricted model. We call A restricted because A implies that . This is a restriction

15、.Another example: C has independent variable (1, , +), D has (1, +). C and A are not nested. C and B are nested. One restriction in C: .C and D are nested. One restriction in D: .D and A are not nested. D and B are nested: two restriction in D: ; .We can always test hypotheses implied in the restric

16、ted models. Steps: run two regression for each hypothesis, one for the restricted model and one for the unrestricted model. The SST should be the same across the two models. What is different is SSE and SSR. That is, what is different is R2. Let; Use the following formulas:or(proof: use SST = SSE+SS

17、R)Note, df(SSEr)-df(SSEu) = df(SSRu)-df(SSRr) =,is the number of constraints (not number of parameters) implied by the restricted model or Note thatThat is, for 1df tests, you can either do an F-test or a t-test. They yield the same result. Another way to look at it is that the t-test is a special c

18、ase of the F test, with the numerator DF being 1. IX. Assumptions of F-testsWhat assumptions do we need to make an ANOVA table work?Not much an assumption. All we need is the assumption that (X'X) is not singular, so that the least square estimate b exists.The assumption of =0 is needed if you w

19、ant the ANOVA table to be an unbiased estimate of the true ANOVA (equation 1) in the population. Reason: we want b to be an unbiased estimator of , and the covariance between b andto disappear. For reasons I discussed earlier, the assumptions of homoscedasticity and non-serial correlation are necess

20、ary for the estimation of . The normality assumption that ei is distributed in a normal distribution is needed for small samples. X. The Concept of IncrementEvery time you put one more independent variable into your model, you get an increase in . We sometime called the increase "incremental.&q

21、uot; What is means is that more variance is explained, or SSR is increased, SSE is reduced. What you should understand is that the incremental attributed to a variable is always smaller than the when other variables are absent. XI. Consequences of Omitting Relevant Independent VariablesSay the true

22、model is the following:.But for some reason we only collect or consider data on . Therefore, we omit in the regression. That is, we omit in our model. We briefly discussed this problem before. The short story is that we are likely to have a bias due to the omission of a relevant variable in the mode

23、l. This is so even though our primary interest is to estimate the effect of or on y. Why? We will have a formal presentation of this problem.XII. Measures of Goodness-of-FitThere are different ways to assess the goodness-of-fit of a model. A. R2R2 is a heuristic measure for the overall goodness-of-f

24、it. It does not have an associated test statistic. R2 measures the proportion of the variance in the dependent variable that is “explained” by the model:R2 =B. Model F-test The model F-test tests the joint hypotheses that all the model coefficients except for the constant term are zero. Degrees of f

25、reedoms associated with the model F-test:Numerator: p-1Denominator: n-p. C. t-tests for individual parametersA t-test for an individual parameter tests the hypothesis that a particular coefficient is equal to a particular number (commonly zero). tk = (bk- bk0)/SEk, where SEkis the (k, k) element of

26、MSE(XX)-1, with degree of freedom=n-p. D. Incremental R2Relative to a restricted model, the gain in R2 for the unrestricted model:DR2= Ru2- Rr2E. F-tests for Nested Model It is the most general form of F-tests and t-tests. It is equal to a t-test if the unrestricted and restricted models differ only

27、 by one single parameter. It is equal to the model F-test if we set the restricted model to the constant-only model. Ask students What are SST, SSE, and SSR, and their associated degrees of freedom, for the constant-only model? Numerical ExampleA sociological study is interested in understanding the

28、 social determinants of mathematical achievement among high school students. You are now asked to answer a series of questions. The data are real but have been tailored for educational purposes. The total number of observations is 400. The variables are defined as:y: math scorex1: father's educa

29、tionx2: mother's educationx3: family's socioeconomic statusx4: number of siblingsx5: class rankx6: parents' total education (note: x6 = x1 + x2) For the following regression models, we know: Table 1 SST SSR SSE DF R2(1) y on (1 x1 x2 x3 x4)34863 4201 (2) y on (1 x6 x3 x4) 34863 396.1065(3) y on (1 x6 x3 x4 x5)34863 10426 24437 395.2991(4) x5 on (1 x6 x3 x4) 269753396.02101. Please fill the missing cells in Table 1. 2. Test the hypothesis that the effects of father's edu

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論