Report

Group #4 AMS 572 – Data Analysis Professor: Wei Zhu 1/85 Lin Wang (Lana) Xian Lin (Ben) Zhide Mo (Jeff) Miao Zhang Yuan Bian Juan E. Mojica Ruofeng Wen Hemal Khandwala Lei Lei Xiaochen Li (Joe) 2/85 3/85 4/85 ANCOVA ANCOVA Analysis of Covariance merge of ANOVA & Linear Regression Analysis of Variance 5/85 6/85 • described by R. A. Fisher to assist in the analysis of data from agricultural experiments. • Compare the means of any number of experimental conditions without any increase in Type 1 error. H0 is rejected when it is true 7/85 ANOVA a way of determining whether the average scores of groups differed significantly. Psychology Assess the average effect of different experimental conditions on subjects in terms of a particular dependent variable. 8/85 An English statistician, Evolutionary biologist, and Geneticist. Feb.17, 1890 – July 29, 1962 Analysis of Variance(ANOVA), Maximum likelihood, F-distribution, etc. 9/85 10/85 • developed and applied in different areas with that of ANOVA • got developed in biology and psychology • The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon 11/85 studied the height of parents and their adult children 5’4’’ 5’6’’ short 5’8’’ < parents’ children are usually shorter than average, but still taller than their parents. 5’9’’ Average height Regression toward the Mean 12/85 applied to data obtained from correlational or non-experimental research helps us understand the effect of changing one independent variable on changing dependent variable value 13/85 (Feb. 16, 1822-Jan. 17, 1851) English anthropologist, eugenicist, and statistician. • widely promoted regression toward the mean • created the statistical concept of correlation • a pioneer in eugenics, coined the term in 1883 • the first to apply statistical methods to the study of human differences 14/85 • a statistical technique that combines regression and ANOVA(analysis of variance). • originally developed by R.A. Fisher to increase the precision of experimental analysis • applied most frequently in quasiexperimental research involve variables cannot be controlled directly 15/85 16/85 Balanced design, if ≡ Sample Mean Sample SD 1 11 12 Treatment 2 … 21 … 22 … 1 2 1 22 … 1 1 2 2 … … 17/85 • = + , where i 1, 2,..., a; j 1, 2,..., ni • ~ , 2 , ~(0, 2 ) • = + , where is the grand mean = + + 18/85 = (grand mean) = − 19/85 Sample Mean Sample SD 1 11 12 Treatment 2 … 21 … 22 … 1 2 1 22 … 1 1 2 2 … … 20/85 • the factor A sum of squares ( − ) = = • the factor A mean square, with − 1 d.f. = = − = ( − ) − 21/85 Sample Mean Sample SD 1 11 12 Treatment 2 … 21 … 22 … 1 2 1 22 … 1 1 2 2 … … 22/85 23/85 Sample Mean Sample SD 1 11 12 Treatment 2 … 21 … 22 … 1 2 1 22 … 1 1 2 2 … … 24/85 • the total sum of squares ( − ) = = = • ANOVA identity = + ( − ) = = = ( − ) + = = ( − ) = = 25/85 Source of Variance Sum of Squares Treatments = Error Total = −1 − = − ( − )2 = −1 ( − )2 = Degrees of Mean Square Freedom F = −1 ( − )2 26/85 27/85 Yij i ij Data, the jth observation of the ith group Grand mean of Y Error N(0,σ2) Effects of the ith group (We focus on if αi = 0, i = 1, …, a) 28/85 Yij 1 X ij 0 ij Data, the (ij)th observation Predictor Error Slope and Intersect (We focus on the estimate) 29/85 Yij i ( X ij X ..) ij Effects of the ith group (We still focus on if αi = 0, i = 1, …, a) Known Covariate (What is this guy doing here?) 30/85 Yij i ( X ij X ..) ij () = − ( − . . Yij (adjust ) i ij (This is just the ANOVA Model!) 31/85 ˆ Yij 1 X ij 0 ij Within each group, consider αi a constant, and notice that we actually only desire the estimate of slope β instead of INTERSECT. Yij i ( X ij X ..) ij 32/85 ˆ • Within each group, do Least Square: ˆi j ( X ij X i. )(Yij Yi. ) 2 ( X X ) j ij i. • Assume that 33/85 ˆ • We use Pooled Estimate of β ˆi ˆ j ( X ij X i. )(Yij Yi. ) 2 ( X X ) j ij i. ˆ ( X X )2 i ij i. i j 2 ( X X ) ij i. i j ( X i ij X i. )(Yij Yi. ) j 2 ( X X ) ij i. i j 34/85 ANCOVA begins: In each group, find Slope Estimation via Linear Regression Yij i ( X ij X ..) ij = − . )( − . ˆ ( X X ) ˆ ( X X ) i Pool them together 2 − . ij i 2 i. j 2 ij i i. j Get rid of the Covariate () = − ( − . . Do ANOVA on the model () = + + Go home and have dinner. Yammy Cheeseburg 2 ice(Coke) ? 35/85 Regression General Linear Model ANOVA /ANCOVA 36/85 Y 0 X Error Response Variable Predictor Intersect Slope All of them are Scalars! 37/85 Y X y1 y m x11 xm1 x1,( n 1) xm ,( n 1) 1 1 1 n 1 n 38/85 Yi 0 1Zi i Outcome of the ith unit coefficient for the intersect coefficient for the slope More about the Zi : Zi =1 if unit is the treatment group Zi =0 if unit is the control group Residual for the ith unit Categorical variable (binary) 39/85 Overall mean response Yijk i j ij Residual for the ith unit ijk Response variable effect due to the ith level of factor A effect due to the jth level of factor B the effect due to any interaction between the ith level of A and the jth level of B 40 The ith response variable Random Error yi 0 1 X i1 2 X i 2 ... p1 X p1 p 2 X p 2 i Categorical Variables Categorical Variables Continuous Variable Continuous Variable The above formula can be simply denoted as: Y X What can this X be? Before we see an example of X, we have learned that General Linear Model covers (1) Simple Linear Regression; (2) Multiple Linear Regression; (3) ANOVA; (4) 2-way/n-way ANOVA. 41/85 X in the GLM might be expanded as Y 0 1 X1 2 X 2 3 X 3 Where X3 in the above formula could be the INTERACTION between X1 and X2 Y 0 1 X1 2 X 2 3 X1 * X 2 Did you see the tricks? Next, let us see what assumptions shall be satisfied before using ANCOVA. 42/85 Before using ANCOVA… 1. Test the homogeneity of variance 2. Test the homogeneity of regression whether H0: 1 ... i ... a 3. Test whether there is a linear relationship between the dependent variable and covariate. 43/85 For each i, calculate the MSEi MSEi SSEi / df SSEi / n 2 Utilize Max( MSEi )and Min( MSEi ) to do a Fmax test i i to make sure is a constant under each different levels. F=Max(MSEi ) / Min(MSEi ) 44/85 1 ... i ... a (1) 45/85 1 ... i ... a (2) a (1) Define SSE G SSEi i 1 SSE G Sum of Square of Errors within Groups SSEi Is calculated based on ˆi AND, SSE G is generated by the random error . 46/85 1 ... i ... a (3) (2) SSE is generated by • Random Error • Difference between distinct ˆi We can calculate SSE based on a common ˆ (3) Let SSB=SSE – SSEG. SSB Sum of Square between Groups SSB is constituted by the difference between different ˆ i 47/85 1 ... i ... a (4) df b df e df G e [a (n 1) 1] a (n 2) a 1 MSB SSB / df b SSB / a 1 MSE G SSE G / df eG SSE G / a (n 2) MSB MSE G Mean Square between Groups Mean Square within Groups Do F test on MSB and MSEG to see whether we can reject our HO F=MSB / MSEG 48/85 Assumption 3: Test a linear relationship between the dependent variable and covariate. Ho: = 0 How to do it? F test on SSR and SSE Sum of Square of Regression 49/85 How to calculate SSR and MSR? ˆ ˆ x ˆi y 0 1 i From each xi yˆ i SSR is the difference obtained from the summation of the square of the differences between yˆ i and y . n SSR ( yˆi y )2 i 1 MSR SSR /1 50/85 How to calculate SSE and MSE? ˆ ˆ x ˆi y 0 1 i From each xi yˆ i SSE is the error obtained from the summation of the square of the differences between yi and y ˆ i. n SSE ( yi yˆi )2 i 1 MSE SSE /(n 2) 51/85 MSR F MSE Based on the T.S. we determine whether to accept H0 ( 0 ) or not. Assume Assumptions 01 and 02 are already passed. • If H0 is true ( 0 ),we do ANOVA. • Otherwise, we do ANCOVA. So, anytime we want to use ANCOVA, we need to test the three assumptions first! 52/85 53/85 • In this hypothetical study, a sample of 36 teams (id in the data set) of 12-year-old children attending a summer camp participated in a study to determine which one of three different tree-watering techniques worked best to promote tree growth. Techniques Frequency Code Watering the base with a hose 10 minutes once per day 1 Watering the ground surrounding (drip system) 2 hours each day 2 Deep watering (sunk pipe) 10 minutes every 3 days 3 54/85 • From a large set of equally sized and equally healthy fast-growing trees, each team was given a tree to plant at the start of the camp. • Each team was responsible for the watering and general care of their trees • At the end of the summer, the height of each tree was measured. 60/85 • that some children might have had more gardening experience than others, and • that any knowledge gained as a result of that prior experience might affect the way the tree was planted and perhaps even the way in which the children cared for the tree and carried out the watering regime. How to approach? Create a indicator for that knowledge. (i.e. a 40 point scale gardering experience) 61/85 Grouping (1,2,3) Dependend Variable id watering technique 1 2 3 4 ……. 32 33 34 35 36 1 1 1 1 ……… 3 3 3 3 3 tree growth dv 39 36 30 42 ……….. 36 30 39 27 24 Covariate Variable gardening exp cov 24 18 21 24 ……… 15 18 18 9 6 Real Data 62/85 Grouping (1,2,3) Dependend Variable id 1 2 3 4 ……. 32 33 34 35 36 Overall Mean tree Response watering technique 1 1 1 1 ……… 3 3 3 3 3 growth dv 39 36 30i 42 ……….. 36 30 39 27 24 Covariate Variable gardening exp cov Residual error 24 18 21 ij 24 ……… 15 Regression coefficient parameter. 18 18 9 6 Yij ( X X ..) ij Real Data 63/85 Homogenity of Regression Homogenity of Variance and dv is Normal Linearity of Regression ANCOVA SAS 64/85 X ,Y cov( X , Y ) XY E[( X X )(Y Y )] XY n n i 1 ( X i X )(Yi Y ) 2 ( X X ) i i 1 n 2 ( Y Y ) i i 1 The Pearson correlation coefficient between the covariate and the dependent var.is .81150. 65/85 Assumptions Clearly a strong linear component to the relationship. Linearity of regression assumption appears to be met by the data set 66/85 The assumption of homogeneity of regression is tested by examining the interaction of the covariate and the independent variable. If it is not statistically significant, as is the case here, then the assumption is met. 67/85 The Model contains the effects of both the covariate and the independent variable. The effects of the covariate and the independent variable are separately evaluated in this summary table. 68/85 69/85 Watering techniques coded as 1 (hose watering) and 3 (deep watering) are the only two groups whose means differ significantly 78/85 • We can assert that prior gardening experience and knowledge was quite influential in how well the trees fared under the attention of the young campers. • when we statistically control for or equate the gardening experience and knowledge of the children, was a relatively strong factor in how much growth was seen in the trees. • On the basis of the adjusted means, we may therefore conclude that, when we statistically control for gardening experience,deep watering is more effective than hose watering but is not significantly more effective than drip watering. 79/85 GROUP VARIABLE, DEPENDENT VARIABLE and COVARIATE 80/85 81/85 Tasks->Graph->Scatter Plot 82/85 Tasks->ANOVA->Linear Models 83/85 84/85 85/85